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B.A.,  ST.  MARY’S  COLLEGE  OF  MARYLAND 
M.S.,  UNIVERSITY  OF  MASSACHUSETTS  AMHERST 
Ph.D.,  UNIVERSITY  OF  MASSACHUSETTS  AMHERST 
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Many  applications  require  predicting  not  a  just  a  single  variable,  but  multiple 
variables  that  depend  on  each  other.  Recent  attention  has  therefore  focused  on  struc¬ 
tured  prediction  methods,  which  combine  the  modeling  flexibility  of  graphical  models 
with  the  ability  to  employ  complex,  dependent  features  typical  of  traditional  classi¬ 
fication  methods.  Especially  popular  have  been  conditional  random  fields  (CRFs), 
which  are  graphical  models  of  the  conditional  distribution  over  outputs  given  a  set 
of  observed  features.  Unfortunately,  parameter  estimation  in  CRFs  requires  repeated 
inference,  which  can  be  computationally  expensive.  Complex  graphical  structures  are 
increasingly  desired  in  practical  applications,  but  then  training  time  often  becomes 
prohibitive. 

In  this  thesis,  I  investigate  efficient  training  methods  for  conditional  random  fields 
with  complex  graphical  structure,  focusing  on  local  methods  which  avoid  propagating 

vii 


information  globally  along  the  graph.  First,  I  investigate  piecewise  training,  which 
trains  each  of  a  model’s  factors  separately.  I  present  three  views  of  piecewise  train¬ 
ing:  as  maximizing  the  likelihood  in  a  so-called  “node-split  graph”,  as  maximizing 
the  Bethe  likelihood  with  uniform  messages,  and  as  generalizing  the  pseudo-moment 
matching  estimator  of  Wainwright,  Jaakkola,  and  Willsky.  Second,  I  propose  piece- 
wise  pseudolikelihood ,  a  hybrid  procedure  which  “pseudolikelihood-izes”  the  piecewise 
likelihood,  and  is  therefore  more  efficient  if  the  variables  have  large  cardinality.  Piece- 
wise  pseudolikelihood  performs  well  even  on  applications  in  which  standard  pseudo- 
likelihood  performs  poorly.  Finally,  motivated  by  the  connection  between  piecewise 
training  and  BP,  I  explore  training  methods  using  beliefs  arising  from  stopping  BP 
before  convergence.  I  propose  a  new  schedule  for  message  propagation  that  improves 
upon  the  dynamic  schedule  proposed  recently  by  Elidan,  McGraw,  and  Roller,  and 
present  suggestive  results  applying  dynamic  schedules  to  the  system  of  equations  that 
combine  inference  and  learning. 

I  also  present  two  novel  families  of  loopy  CRFs,  which  appear  as  test  cases  through¬ 
out.  First  is  the  dynamic  CRF,  which  combines  the  factorized  state  representation  of 
dynamic  Bayesian  networks  with  the  modeling  flexibility  of  conditional  models.  The 
second  of  these  is  the  skip-chain  CRF,  which  models  the  fact  that  identical  words  are 
likely  to  have  the  same  label,  even  if  they  occur  far  apart. 
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CHAPTER  1 


INTRODUCTION 


Fundamental  to  many  applications  is  the  ability  to  predict  multiple  variables  that 
depend  on  each  other.  Such  applications  are  as  diverse  as  classifying  regions  of  an 
image  [61],  estimating  the  score  in  a  game  of  Go  [116],  segmenting  genes  in  a  strand 
of  DNA  [5],  and  extracting  syntax  from  natural- language  text  [130].  In  such  applica¬ 
tions,  we  wish  to  predict  a  vector  y  =  {yo,Ui,  ■  ■  ■  ,Vt}  of  random  variables  given  an 
observed  feature  vector  x.  A  relatively  simple  example  from  natural-language  pro¬ 
cessing  is  part-of-speech  tagging,  in  which  each  variable  ys  is  the  part-of-speech  tag  of 
the  word  at  position  s,  and  the  input  x  is  divided  into  feature  vectors  (x0,  xi . . .  xy}. 
Each  xs  contains  various  information  about  the  word  at  position  s,  such  as  its  identity, 
orthographic  features  such  as  prefixes  and  suffixes,  membership  in  domain-specific 
lexicons,  and  information  in  semantic  databases  such  as  WordNet. 

One  approach  to  this  multivariate  prediction  problem,  especially  if  our  goal  is  to 
maximize  the  number  of  ys  that  are  correctly  classified,  is  to  learn  a  per-position 
classifier  that  maps  x  i— >  ys  for  each  s.  There  are  two  difficulties  with  this  method, 
however.  The  first  is  that  we  are  not  always  interested  in  maximizing  the  number 
of  correct  predictions.  Sometimes  the  objective  function  over  predictions  may  be  to 
maximize  the  probability  that  the  entire  sequence  is  correct,  or  to  maximize  a  more 
complicated  function  like  BLEU  or  F\ .  The  second  difficulty  is  that  the  size  of  both 
the  input  and  output  vectors  can  be  extremely  large.  For  example,  in  part-of-speech 
tagging,  each  vector  xs  may  have  tens  of  thousands  of  components,  so  a  classifier 
based  on  all  of  x  would  have  many  parameters.  But  using  only  xs  to  predict  ys  is  also 
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bad,  because  information  from  neighboring  feature  vectors  is  also  useful  in  making 
predictions.  Both  of  these  difficulties  can  be  addressed  by  explicitly  modeling  the 
dependence  between  outputs,  so  that  a  confident  prediction  about  one  variable  needs 
to  be  able  to  influence  nearby,  possibly  less  confident  predictions. 

A  natural  way  to  represent  the  manner  in  which  variables  depend  on  each  other  is 
provided  by  graphical  models.  Graphical  models — which  include  such  diverse  model 
families  as  Bayesian  networks,  neural  networks,  factor  graphs,  Markov  random  fields, 
Ising  models,  and  others — represent  a  complex  distribution  over  many  variables  as  a 
product  of  local  factors  on  smaller  subsets  of  variables.  It  is  then  possible  to  describe 
how  a  given  factorization  of  the  probability  density  corresponds  to  a  particular  set 
of  conditional  independence  relationships  satisfied  by  the  distribution.  This  corre¬ 
spondence  makes  modeling  much  more  convenient,  because  often  our  knowledge  of 
the  domain  suggests  reasonable  conditional  independence  assumptions,  which  then 
determine  our  choice  of  factors. 

Much  of  the  early  work  in  learning  with  graphical  models,  especially  in  statistical 
natural-language  processing,  focused  on  generative  models  that  explicitly  attempted 
to  model  a  joint  probability  distribution  over  inputs  and  outputs.  Although  there 
are  advantages  to  this  approach,  it  also  has  important  limitations.  Not  only  can  the 
dimensionality  of  x  become  very  large,  but  the  features  have  complex  dependencies, 
so  constructing  a  probability  distribution  over  them  is  difficult.  In  practice,  one 
must  either  employ  a  complex  model  of  the  features,  which  is  intractable,  or  make 
strong  independence  assumptions  among  the  features,  which  can  lead  to  reduced 
performance. 

An  alternative  approach  is  to  predict  y  directly,  without  modeling  x.  This  is  the 
idea  behind  structured  prediction.  Structured  prediction  methods  are  essentially  a 
combination  of  classification  and  graphical  modeling,  combining  the  ability  to  com¬ 
pactly  model  multivariate  data  with  the  ability  to  perform  prediction  using  large  sets 
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of  input  features.  The  idea  is,  for  an  input  x,  to  define  a  discriminant  function  Fx(y), 
and  predict  y*  =  argmaxy  Fx( y).  This  function  factorizes  according  to  a  set  of  local 
factors,  just  as  in  graphical  models.  But  as  in  classification,  each  local  factor  is  mod¬ 
eled  a  linear  function  of  x,  although  perhaps  in  some  induced  high-dimensional  space. 
To  understand  the  benefits  of  this  approach,  consider  a  hidden  Markov  model  (for¬ 
mally  introduced  in  Section  2.2.2)  and  a  set  of  per-position  classifiers,  both  with  fixed 
parameters.  In  principle,  the  per-position  classifiers  predict  an  output  ys  given  all  of 
xo . . .  x^.1  In  the  HMM,  on  the  other  hand,  to  predict  ys  it  is  statistically  sufficient 
to  know  only  the  local  input  xs,  the  previous  forward  message  p(ys- i,x0. .  .  x^), 
and  the  backward  message  p(xs+ 1 . . .  xt|?/s).  So  the  forward  and  backward  messages 
serve  as  a  summary  of  the  rest  of  the  input,  a  summary  that  is  generally  non-linear 
in  the  observed  features. 

1.1  Approaches  to  Structured  Prediction 

Several  types  of  structured  prediction  algorithms  have  been  studied.  All  such 
algorithms  assume  that  the  discriminant  function  Fx(y)  over  labels  can  be  written 
as  a  sum  of  local  functions  Fx(y)  =  fa(ya, x,  #)•  The  task  is  to  estimate  the 
real-valued  parameter  vector  6  given  a  training  set  T>  =  {x^yM}^.  The  methods 
differ  in  how  the  parameters  are  selected. 

In  this  thesis,  I  focus  on  conditional  random  fields  (CRFs)  [58],  in  which  the 
score  Fx( y)  is  viewed  as  defining  a  conditional  probability  distribution  p(y|x)  oc 
exp{Fx(y)}.  As  we  see  in  detail  in  Chapter  2,  training  requires  computing  marginal 
distributions,  which  are  intractable  in  general.  The  task  of  dealing  with  this  in¬ 
tractable  will  be  the  main  focus  of  this  thesis.  Perhaps  the  main  advantage  of  prob- 

1To  be  fair,  in  practice  the  classifier  for  ys  would  probably  depend  only  on  a  sliding  window 
around  xs,  rather  than  all  of  x.  But  still  the  structured  approach  has  the  advantage  that  the 
forward  and  backward  messages  serve  as  a  flexible,  nonlinear  summary  of  the  surrounding  input. 
This  has  the  effect  of  choosing  an  effective  window  size  from  the  data. 
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abilistic  methods  is  that  they  can  incorporate  latent  variables  in  a  natural  way,  by 
marginalization.  A  particularly  powerful  example  of  this  is  provided  by  Bayesian 
methods,  in  which  the  model  parameters  themselves  are  integrated  out. 

Alternative  structured  prediction  methods  are  based  on  maximizing  over  assign¬ 
ments  rather  than  marginalizing.  Perhaps  the  most  popular  of  these  methods  has 
been  maximum-margin  methods  that  are  so  successful  for  univariate  classification. 
Recently  max-margin  methods  have  been  generalized  to  the  structured  case  [3,  129]. 
Both  batch  and  online  algorithms  are  available  to  maximize  this  objective  function. 
The  perceptron  update  can  also  be  generalized  to  structured  models  [22],  The  result¬ 
ing  algorithm  is  particularly  appealing  because  it  is  little  more  difficult  to  implement 
than  the  algorithm  for  selecting  y*.  The  online  perceptron  update  can  also  be  made 
margin-aware,  yielding  the  MIRA  algorithm  [25],  which  may  perform  better  than  the 
perceptron  update. 

Another  class  of  methods  are  search-based  methods  [28]  in  which  a  heuristic  search 
procedure  over  outputs  is  assumed,  and  learns  a  classifier  that  predicts  the  next  step 
in  the  search.  This  has  the  advantage  of  fitting  in  nicely  to  many  problems  that  are 
complex  enough  to  require  performing  search.  It  is  also  able  to  incorporate  arbitrary 
loss  functions  over  predictions. 

A  general  advantage  of  all  of  these  maximization-based  methods  is  that  they 
do  not  require  summation  over  all  configurations  for  the  partition  function  or  for 
marginal  distributions.  There  are  certain  combinatorial  problems,  such  as  matching 
and  network  flow  problems,  in  which  finding  an  optimal  configuration  is  tractable, 
but  summing  over  configurations  is  not  (for  an  example  of  applying  max-margin 
methods  in  such  situations,  see  Taskar  et  al.  [131]).  For  more  complex  problems, 
neither  summation  nor  maximization  is  tractable,  so  this  advantage  is  perhaps  not 
as  significant.  Another  advantage  of  these  methods  is  that  kernels  can  be  naturally 
incorporated. 
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Finally,  LeCun  et  al.  [60]  generalizes  many  prediction  methods,  including  the  ones 
listed  above,  under  the  rubric  of  energy-based  methods,  and  presents  interesting  his¬ 
torical  information  about  their  use.  They  advocate  changing  the  loss  function  to  avoid 
probabilities  altogether,  and  so  their  work  may  serve  as  an  interesting  complement 
to  my  work  in  this  thesis. 

An  older  method  for  predicting  multiple  outputs  simultaneously  is  neural  net¬ 
works.  The  main  difference  between  the  more  recent  structured  prediction  work  and 
neural  networks  is  that  neural  networks  represent  the  dependence  between  output 
variables  using  a  shared  latent  representation,  while  structured  methods  learn  these 
dependences  as  direct  functions  of  the  output  variables.  Therefore,  the  main  insight 
of  structured  models  can  be  expressed  in  the  language  of  neural  networks  as:  If  you 
add  connections  among  the  nodes  in  the  output  layer,  then  in  some  problems  you 
do  not  need  a  hidden  layer  to  get  good  performance.  Omitting  the  hidden  layer  has 
great  computational  advantages,  because  the  objective  functions  used  for  training 
become  convex,  and  we  do  not  need  to  worry  about  the  problems  of  local  minima 
that  arise  when  training  neural  networks.  For  harder  problems,  however,  one  might 
expect  that  even  after  modeling  output  structure,  incorporating  hidden  state  will  still 
yield  additional  benefit.  Once  hidden  state  is  introduced  into  the  model,  whether  it 
be  a  neural  network  or  a  structured  model,  the  loss  of  convexity  is  inevitable.  There 
are  currently  few  examples  of  structured  models  with  latent  variables  (for  exceptions, 
see  Quattoni  et  al.  [97]  and  McCallum  et  al.  [73]),  but  it  is  likely  that  such  models 
will  become  more  important  in  the  future. 

The  differences  between  the  various  structured  prediction  methods  are  not  well 
understood.  For  example,  1  am  not  aware  of  any  empirical  study  that  compares 
these  algorithms  on  a  broad  range  of  data  sets.  Indeed,  my  presentation  in  this 
section  is  motivated  by  the  view  that  the  similarities  between  various  structured 
prediction  methods  are  more  important  than  the  differences.  For  this  reason,  my 
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choice  in  this  thesis  to  focus  on  conditional  random  fields  is  difficult  to  justify,  but 
perhaps  of  secondary  importance.  As  I  explain  next,  my  main  area  of  interest  is  to 
incorporate  approximate  inference  algorithms  into  training  of  models  with  intractable 
structure,  and  most  structured  prediction  methods  do  require  performing  inference 
during  training. 

1.2  Efficient  Training  of  CRFs 

Early  work  on  CRFs  focused  on  the  case  in  which  the  variables  y  are  arranged 
in  a  linear  chain,  because  this  choice  allows  the  marginals  of  the  distribution  to  be 
computed  exactly  using  the  forward-backward  algorithm,  and  because  this  choice 
is  very  natural  for  certain  tasks  such  as  information  extraction  and  shallow  parsing. 
Recently,  however,  research  in  NLP  has  begun  to  explore  global  models,  which  exploit 
long-distance  dependencies  between  words  to  improve  performance  [14,  32,  118].  With 
such  rich  models,  however,  a  major  difficulty  with  CRFs  becomes  the  amount  of 
computation  required  for  training.  This  is  because  the  likelihood  gradient  requires 
matching  the  marginal  distributions  of  the  model  to  those  of  the  training  data,  and 
computing  the  model  marginals  is  intractable  for  general  graphs.  In  addition,  the 
amount  of  training  data  can  be  somewhat  large,  with  the  largest  data  sets  reaching 
one  million  words  of  labeled  training  data.  The  standard  approach  for  dealing  with 
this  is  to  approximate  the  model  marginals,  most  commonly  using  variational  methods 
such  as  belief  propagation,  although  many  other  methods  are  also  possible. 

Because  inference  in  graphical  models  is  intractable,  there  is  a  vast  literature 
on  approximate  inference.  Here  I  mention  a  few  recent  studies  that  are  relevant  to 
efficient  training  methods  in  undirected  models  and  in  conditional  random  fields. 

For  generative  models,  Abbeel  et  al.  [1]  present  parameter  estimation  and  structure 
learning  algorithms  that,  unlike  maximum  likelihood,  require  polynomial  time  and 
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have  polynomial  sample  complexity.  These  algorithms  are  not  yet  practical,  however, 
because  they  make  poor  use  of  the  training  data. 

Remarkably,  the  pseudo-moment  matching  estimator  of  Wainwright  et  al.  [142] 
computes  a  parameter  setting  that  maximizes  the  BP  approximation  to  the  likelihood 
without  ever  requiring  any  of  the  message  updates  to  be  computed.  This  estimator 
appears  to  have  very  limited  practical  applicability,  however,  because  it  applies  only 
to  generative  random  fields  with  a  special  parameterization  that  does  not  allow  tied 
parameters.  In  this  thesis  I  present  a  generalization  that  handles  these  issues  (Sec¬ 
tion  4.1.3). 

Although  the  original  work  on  CRFs  used  iterative  scaling,  it  was  later  found  that 
second-order  gradient-based  methods  converged  much  faster  [112].  Recently,  however, 
online  methods  have  been  shown  to  converge  much  faster  than  second-order  methods 
for  CRFs  [136].  Globerson  et  al.  [42]  present  a  particularly  interesting  method  using 
the  dual  of  the  likelihood,  provably  finding  the  minimum  if  the  step  size  is  small 
enough,  but  with  an  online-like  update.  The  difference  between  these  methods  and 
my  work  is  that  they  focus  on  the  case  where  there  are  many  iid  training  instances, 
not  when  there  is  intractable  structure.  In  cases  where  there  are  both  many  training 
examples  and  intractable  structure,  then  one  of  these  online-style  updates  could  be 
incorporated  orthogonally  into  the  methods  presented  in  this  thesis. 

1.3  Main  Contributions 

The  main  contributions  of  this  thesis  are: 

•  Modeling  (Chapter  3).  I  present  two  novel  classes  of  loopy  conditional  models: 
dynamic  conditional  random  fields  (DCRFs)  and  skip-chain  CRFs.  A  common 
theme  of  these  models  is  to  show  that  adding  long-distance  dependencies  be¬ 
tween  words  can  improve  performance  in  sequence  labeling  tasks.  These  models 
also  motivate  the  need  for  more  efficient  training  methods  for  CRFs. 
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Piecewise  Training  (Chapter  4).  I  introduce  piecewise  training,  a  method  for 
training  regions  of  a  factor  graph  separately  and  combining  them  at  test  time. 
This  method  has  appeared  as  a  heuristic  in  scattered  places  in  the  literature, 
but  has  never  been  studied  systematically.  On  several  benchmark  NLP  tasks,  I 
show  that  the  factor-as-piece  approximation  performs  surprisingly  well,  always 
exceeding  pseudolikclihood  and  sometimes  rivaling  exact  maximum  likelihood. 
In  addition,  piecewise  training  can  be  understood  as  a  generalization  of  the 
pseudo-moment  matching  estimator  of  Wainwright  et  ah  [142]  that  allows  for 
conditional  models  with  arbitrary  parameterization.  This  provides  a  satisfying 
theoretical  explanation  of  the  positive  experimental  results  of  the  factor-as-piece 
approximation. 

Piecewise  Pseudolikelihood  (Chapter  5).  Piecewise  training  scales  poorly 
in  computation  time  to  variables  or  factors  that  have  large  cardinality.  Pseu¬ 
dolikelihood  scales  much  better  in  computation  time,  but  has  poor  accuracy 
in  the  NLP  data  that  I  consider.  Therefore  I  propose  a  hybrid  method  called 
piecewise  pseudolikelihood  (PWPL),  which  “pseudolikelihoodizes”  (in  a  sense 
that  can  be  stated  formally)  the  individual  terms  in  the  piecewise  likelihood.  I 
show  that  under  certain  conditions,  PWPL  converges  to  the  piecewise  solution 
in  the  limit  of  infinite  data.  In  terms  of  accuracy,  PWPL  performs  more  like 
piecewise  training  than  like  pseudolikclihood,  making  it  a  practical  alternative 
to  pseudolikelihood  for  the  problems  considered  here. 

Improved  Dynamic  Schedules  for  Belief  Propagation  (Section  6.1).  In 
Chapter  4,  I  show  how  piecewise  training  can  be  viewed  as  a  type  of  early  stop¬ 
ping  of  belief  propagation.  But  the  running  time  of  BP  and  even  its  convergence 
depend  greatly  on  the  schedule  used  to  send  the  messages.  Dynamic  update 
schedules  have  recently  been  shown  to  converge  much  faster  on  hard  networks 


than  static  schedules  by  Elidan  et  al.  [30],  who  propose  a  simple  and  effective 
schedule  which  they  call  residual  BP.  But  that  RBP  algorithm  wastes  message 
updates:  many  messages  are  computed  solely  to  determine  their  priority,  and 
are  never  actually  performed.  I  show  that  estimating  the  residual,  rather  than 
calculating  it  directly,  leads  to  significant  decreases  in  the  number  of  messages 
required  for  convergence,  and  in  the  total  running  time.  The  residual  is  esti¬ 
mated  using  an  upper  bound  based  on  recent  work  on  message  errors  in  BP.  On 
both  synthetic  and  real-world  networks,  this  dramatically  decreases  the  running 
time  of  BP,  in  some  cases  by  a  factor  of  five,  without  affecting  the  quality  of 
the  solution. 

•  Integrating  Inference  and  Learning  (Section  6.2).  Piecewise  training  can 
be  seen  as  a  sort  of  early  stopping  of  BP,  one  in  which  BP  is  stopped  before 
sending  any  messages.  Therefore  it  is  natural  to  ask  whether  one  can  do  better 
by  using  some  less  drastic  form  of  early  stopping.  I  propose  a  view  of  such 
methods  as  attempting  to  find  fixed  points  of  a  single  system  of  equations, 
which  includes  both  gradient  updates  on  the  model  parameters  and  BP  message 
updates.  Update  schedules  for  solving  this  system  can  be  seen  as  integrating 
inference  and  learning,  because  they  have  the  freedom  to  dynamically  choose 
when  to  make  parameter  updates  and  when  to  send  messages. 

1.4  Declaration  of  Previous  Work 

Most  the  work  of  this  thesis  has  been  previously  published.  The  rest  has  been 
collected  into  several  technical  reports.  These  are: 

•  Much  of  the  tutorial  information  in  Chapter  2  has  been  published  in  Sutton 
and  McCallum  [121]. 
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•  The  work  on  DCRFs  in  Section  3.1  has  appeared  as  Sutton,  McCallum,  and 
Rohanimanesh  [126],  which  was  based  on  two  earlier  conference  papers  [120, 
125], 

•  The  work  on  skip-chain  CRFs  in  Section  3.2  appeared  as  Sutton  and  McCallum 
[118]. 

•  The  initial  work  on  piecewise  training  was  Sutton  and  McCallum  [119]. 

•  The  work  on  shared-unary  piecewise  and  one-step  cutout  (Section  4.4  and  4.5) 
was  done  at  Microsoft  Research  in  collaboration  with  Tom  Minka,  and  appears 
in  an  MSR  tech  report  [124], 

•  Chapter  5  on  piecewise  pseudolikelihood  was  originally  published  as  Sutton  and 
McCallum  [122], 

•  The  work  on  zero- lookahead  RBPO  in  Section  6.1  has  been  published  as  Sutton 
and  McCallum  [123]. 
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CHAPTER  2 


BACKGROUND 


This  chapter  provides  the  statistical  and  algorithmic  background  necessary  to 
understand  the  current  work.  I  review  necessary  concepts  in  graphical  models  and 
inference  (Section  2.1),  and  then  explain  conditional  random  fields,  starting  with  the 
linear-chain  case  (Section  2.3)  and  then  describing  CRFs  in  general  (Section  2.4). 

2.1  Graphical  Models 

Graphical  modeling  is  a  powerful  framework  for  representation  and  inference  in 
multivariate  probability  distributions.  It  has  proven  useful  in  diverse  areas  of  stochas¬ 
tic  modeling,  including  coding  theory  [76],  computer  vision  [38],  knowledge  represen¬ 
tation  [91],  Bayesian  statistics  [37],  and  natural-language  processing  [11,  58]. 

Distributions  over  many  variables  can  be  very  expensive  to  represent  naively.  For 
example,  a  table  of  joint  probabilities  of  n  binary  variables  requires  storing  0(2”) 
floating-point  numbers.  The  insight  of  the  graphical  modeling  perspective  is  that 
even  when  a  distribution  is  defined  over  a  large  set  of  variables,  it  can  often  be  rep¬ 
resented  as  a  product  of  local  functions  that  depend  on  a  much  smaller  subset  of 
variables.  This  factorization  turns  out  to  have  a  close  connection  to  certain  con¬ 
ditional  independence  relationships  among  the  variables — both  types  of  information 
being  easily  summarized  by  a  graph.  Indeed,  this  relationship  between  factorization, 
conditional  independence,  and  graph  structure  comprises  much  of  the  power  of  the 
graphical  modeling  framework:  the  conditional  independence  viewpoint  is  most  use- 
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ful  for  designing  models,  and  the  factorization  viewpoint  is  most  useful  for  designing 
inference  algorithms. 

In  the  rest  of  this  section,  I  introduce  graphical  models  from  both  the  factorization 
and  conditional  independence  viewpoints,  focusing  on  those  models  which  are  based 
on  undirected  graphs,  because  such  models  are  the  principal  topic  of  this  thesis.  I  also 
discuss  a  few  approximate  inference  algorithms  that  are  useful  in  the  present  work. 

2.1.1  Undirected  Graphical  Models 

We  consider  probability  distributions  over  sets  of  random  variables  V  =  X  U  Y, 
where  X  is  a  set  of  input  variables  that  we  assume  are  observed,  and  Y  is  a  set  of 
output  variables  that  we  wish  to  predict.  Every  variable  s  G  V  takes  outcomes  from 
a  set  V,  which  can  be  either  continuous  or  discrete,  although  I  consider  only  the 
discrete  case  in  this  thesis.  An  arbitrary  assignment  to  X  is  denoted  by  a  vector  x. 
Given  a  variable  s  e  X,  the  notation  xs  denotes  the  value  assigned  to  s  by  x,  and 
similarly  for  an  assignment  to  a  subset  a  C  X  by  xa.  The  notation  1{X=X'}  denotes 
an  indicator  function  of  x  which  takes  the  value  1  when  x  =  x'  and  0  otherwise.  We 
also  require  a  special  notation  for  marginalization.  For  a  fixed  variable  assignment  ys , 
we  use  the  summation  Yly\y  t°  indicate  a  summation  over  all  possible  assignments 
y  whose  value  for  variable  s  is  equal  to  ys. 

Suppose  that  we  have  reason  to  believe  that  a  probability  distribution  p  of  interest 
can  be  represented  by  a  product  of  factors  of  the  form  Ta(xa,ya),  where  each  factor 
scope  a  C  V.  This  factorization  can  allow  us  to  represent  p  much  more  efficiently, 
because  the  sets  a  may  be  much  smaller  than  the  full  variable  set  V.  We  assume  that 
without  loss  of  generality  that  each  distinct  set  a  has  at  most  one  factor  Ta,  so  that 

An  undirected  graphical  model  is  a  family  of  probability  distributions  that  fac¬ 
torize  according  to  given  collection  of  scopes.  Formally,  given  a  collection  of  subsets 
X  =  a  C  V,  an  undirected  graphical  model  is  defined  as  the  set  of  all  distributions 
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that  can  be  written  in  the  form 


p(x,  y)  =  ^  n  y«)>  (2-1) 

a&T 

for  any  choice  of  local  function  F  =  {'ha},  where  'ha  :  V ^  — >  IR+.  (These  functions 
are  also  called  factors  or  compatibility  functions.)  We  will  occasionally  use  the  term 
random  field  to  refer  to  a  particular  distribution  among  those  defined  by  an  undirected 
model.  The  reason  for  the  term  graphical  model  will  become  apparently  shortly,  when 
we  discuss  how  the  factorization  of  (2.1)  can  be  represented  as  a  graph. 

The  constant  Z  is  a  normalization  factor  that  ensures  the  distribution  p  sums  to 
1.  It  is  defined  as 

z  =  II  y«)-  (2-2) 

x.y  a^T 

The  quantity  Z,  considered  as  a  function  of  the  set  F  of  factors,  is  called  the  partition 
function  in  the  statistical  physics  and  graphical  models  communities.  Computing  Z 
is  intractable  in  general,  but  much  work  exists  on  how  to  approximate  it. 

We  will  generally  assume  further  that  each  local  function  has  the  form 

^a(xa,ya)  =  exp  j^0afc/afc(xa,ya)  j  ,  (2.3) 

for  some  real- valued  parameter  vector  6a,  and  for  some  set  of  feature  functions  or 
sufficient  statistics  {fak}-  If  x  and  y  are  discrete,  then  this  is  no  restriction,  because 
we  can  have  features  have  indicator  functions  for  every  possible  value,  that  is,  if  we 
include  one  feature  function  /afc(xa,ya)  =  l{Xa=xj}l{ya=yj}  for  every  possible  value 
x*  and  y*. 

Also,  a  consequence  of  this  parameterization  is  that  the  family  of  distributions 
over  V  parameterized  by  6  is  an  exponential  family.  Indeed,  much  of  the  discussion 
of  maximum-likelihood  parameter  estimation  in  this  chapter  applies  to  exponential 
families  in  general. 
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Figure  2.1.  A  Markov  network  with  an  ambiguous  factorization.  Both  of  the  factor 
graphs  at  right  factorize  according  to  the  Markov  network  at  left. 

As  we  have  mentioned,  there  is  a  close  connection  between  the  factorization  of 
a  graphical  model  and  the  conditional  independencies  among  the  variables  in  its 
domain.  This  connection  can  be  understood  by  means  of  an  undirected  graph  known 
as  a  Markov  network,  which  directly  represents  conditional  independence  relationships 
in  a  multivariate  distribution.  Let  G  be  an  undirected  graph  with  variables  V,  that 
is,  G  has  one  node  for  every  random  variable  of  interest.  For  a  variable  s  G  V,  let 
N(s)  denote  the  neighbors  of  s.  Then  we  say  that  a  distribution  p  is  Markov  with 
respect  to  G  if  it  meets  the  local  Markov  property:  for  any  two  variables  s,t  E  V, 
the  variable  s  is  independent  of  t  conditioned  on  its  neighbors  N(s).  Intuitively,  this 
means  that  the  neighbors  of  s  contain  all  of  the  information  necessary  to  predict  its 
value. 

Given  a  factorization  of  a  distribution  p  as  in  (2.1),  an  equivalent  Markov  network 
can  be  constructed  by  connecting  all  pairs  of  variables  that  share  a  local  function. 
It  is  straightforward  to  show  that  p  is  Markov  with  respect  to  this  graph,  because 
the  conditional  distribution  p(a:s|x7v(s))  that  follows  from  (2.1)  is  a  function  only  of 
variables  that  appear  in  the  Markov  blanket. 

In  other  words,  if  p  factorizes  according  to  G,  then  p  is  Markov  with  respect  to 
G.  The  converse  direction  also  holds,  as  long  as  p  is  strictly  positive.  This  is  stated 
in  the  following  classical  result  [7,  45]: 
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Theorem  2.1  (Hammersley-Clifford).  Suppose  p  is  a  strictly  positive  distribution, 
and  G  is  an  undirected  graph  that  indexes  the  domain  of  p.  Then  p  is  Markov  with 
respect  to  G  if  and  only  if  p  factorizes  according  to  G. 

A  Markov  network  has  an  undesirable  ambiguity  from  the  factorization  perspec¬ 
tive,  however.  Consider  the  three-node  Markov  network  in  Figure  2.1  (left).  Any 
distribution  that  factorizes  as  p{x\,x2,xf)  oc  a(xi,  X2,  xf)  for  some  positive  function 
a  is  Markov  with  respect  to  this  graph.  However,  we  may  wish  to  use  a  more  re¬ 
stricted  parameterization,  where  p(x i,x2,x3)  oc  a(xi,  x2)b(x2,  x3)c(xi,  xf).  This  sec¬ 
ond  parameterization  denotes  a  smaller  set  of  models,  which  therefore  may  be  more 
amenable  to  parameter  estimation.  But  the  Markov  network  formalism  cannot  distin¬ 
guish  between  these  two  parameterizations.  In  order  to  state  models  more  precisely, 
the  factorization  (2.1)  can  be  represented  directly  by  means  of  a  factor  graph  [54], 
A  factor  graph  is  a  bipartite  graph  G  =  (V,  F,  E)  in  which  a  variable  node  vs  E  V 
is  connected  to  a  factor  node  Ta  e  F  if  vs  is  an  argument  to  Ta.  An  example  of  a 
factor  graph  is  shown  graphically  in  Figure  2.2  (right).  In  that  figure,  the  circles  are 
variable  nodes,  and  the  shaded  boxes  are  factor  nodes. 

2.1.2  Directed  Graphical  Models 

Whereas  the  local  functions  in  an  undirected  model  need  not  have  a  direct  prob¬ 
abilistic  interpretation,  a  directed  graphical  model  describes  how  a  distribution  fac¬ 
torizes  into  local  conditional  probability  distributions.  Let  G  =  (V,  E)  be  a  directed 
acyclic  graph.  A  directed  graphical  model  is  a  family  of  distributions  that  factorize 
as: 

p(y>x)  =  (2-4) 

v£V 

where  n(v)  are  the  parents  of  v  in  G.  An  example  of  a  directed  model  is  shown  in 
Figure  2.2  (left).  It  can  be  shown  by  structural  induction  on  G  that  p  is  properly 
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normalized.  Directed  models  are  often  used  as  generative  models,  as  we  explain  in 
Section  2.2.3. 

2.1.3  Inference 

The  fundamental  problem  in  graphical  models  is  inference ,  that  is,  given  a  speci¬ 
fication  of  the  joint  distribution  p(y)  over  all  the  variables,  to  compute  the  resulting 
marginal  distribution  over  subsets  of  Y.  (In  order  to  simplify  notation,  we  have  omit¬ 
ted  the  variables  X  in  this  section.)  There  are  two  inference  problems  of  interest  in 
this  thesis:  first,  computing  the  marginal  distributions  of  single  variables 

p(ys)  =  J^p(y)>  (2.5) 

y\vs 

and  second,  that  of  computing  max-marginals 

S(ya)  =  maxp(y).  (2.6) 

y\ys 

We  will  also  be  interested  in  the  problem  of  computing  marginals  over  factors,  that 
is,  p(ya)  =  J2y\ya  P(y)  where  the  set  a  C  V  is  the  scope  of  some  factor. 

These  two  inference  problems  can  be  seen  as  fundamentally  the  same  operation 
on  two  different  semirings  [2],  that  is,  to  change  the  marginal  problem  to  the  max- 
marginal  problem,  we  simply  substitute  max  for  plus.  Indeed,  many  algorithms  for 
computing  marginals  have  analogous  procedures  for  computing  max-marginals.  Al¬ 
though  for  discrete  variables  the  marginals  can  be  computed  by  brute-force  summa¬ 
tion,  the  time  required  to  do  this  is  exponential  in  the  size  of  Y.  Indeed,  both  inference 
problems  are  intractable  for  general  graphs,  because  any  propositional  satisfiability 
problem  can  be  easily  represented  as  a  factor  graph. 

Exact  inference  algorithms  are  known  for  general  graphs.  Although  these  algo¬ 
rithms  require  exponential  time  in  the  worst  case,  they  can  still  be  efficient  for  many 
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graphs  that  occur  in  practice.  The  most  popular  such  algorithm,  the  junction  tree 
algorithm,  successively  clusters  variables  until  the  graph  becomes  a  tree.  Once  such 
an  equivalent  tree  has  been  constructed,  its  marginals  can  be  computed  using  ex¬ 
act  inference  algorithms  that  are  specific  to  trees,  one  of  which  is  described  in  the 
next  section.  For  certain  complex  graphs,  the  junction  tree  algorithm  is  forced  to 
make  clusters  which  are  very  large,  so  that  on  such  graphs  the  procedure  requires 
exponential  time.  For  more  details  on  exact  inference,  see  Cowell  et  al.  [24], 

For  this  reason,  an  enormous  amount  of  effort  has  been  devoted  to  approximate 
inference  algorithms.  Two  classes  of  approximate  inference  algorithms  have  received 
the  most  attention:  Monte  Carlo  algorithms,  that  attempt  to  sample  from  the  dis¬ 
tribution  of  interest;  and  variational  algorithms,  that  convert  the  inference  problem 
into  an  optimization  problem,  which  is  then  relaxed  or  approximated  until  it  becomes 
tractable.  Generally,  Monte  Carlo  algorithms  are  guaranteed  to  sample  from  the  dis¬ 
tribution  of  interest  given  enough  computation  time,  although  it  is  usually  impossible 
in  practice  to  know  when  that  point  has  been  reached.  Variational  algorithms,  on 
the  other  hand,  can  be  faster,  but  they  tend  to  be  biased,  by  which  1  mean  that  they 
tend  to  have  a  source  of  error  that  is  inherent  to  the  approximation,  and  cannot  be 
easily  lessened  by  giving  them  more  computation  time. 

Despite  this,  1  focus  on  variational  algorithms  in  this  thesis,  for  two  reasons. 
First,  parameter  estimation  requires  performing  inference  many  times,  and  so  a  fast 
inference  procedure  is  vital  to  efficient  training.  Second,  a  natural  connection  exists 
between  variational  inference  algorithms  and  performing  parameter  estimation  on 
subgraphs,  which  is  one  of  the  central  ideas  in  this  thesis  (Chapter  4). 

2.1.4  Belief  Propagation 

An  important  variational  inference  algorithm  for  is  belief  propagation  (BP),  which 
I  explain  in  this  section.  1  choose  to  explain  BP  in  detail  for  two  reasons:  First,  it  is  a 
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direct  generalization  of  the  exact  inference  algorithms  for  linear-chain  CRFs.  Second, 
and  more  important  for  the  purposes  of  this  thesis,  it  has  a  close  connection  to  the 
piecewise  methods  that  I  introduce  in  Chapter  4. 

Suppose  that  G  is  a  tree,  and  we  wish  to  compute  the  marginal  distribution  of  a 
variable  s.  The  intuition  behind  BP  is  that  each  of  the  neighboring  factors  of  s  makes 
a  multiplicative  contribution  to  the  marginal  of  s,  called  a  message ,  and  each  of  these 
messages  can  be  computed  separately  because  the  graph  is  a  tree.  More  formally,  for 
every  factor  a  G  N(s),  call  Va  the  set  of  variables  that  are  “upstream”  of  a,  that  is, 
the  set  of  variables  v  for  which  a  is  between  s  and  v.  In  a  similar  fashion,  call  Fa 
the  set  of  factors  that  are  upstream  of  a,  including  a  itself.  Bnt  now  because  G  is  a 
tree,  the  sets  {Va}  U  {s}  form  a  partition  of  the  variables  in  G.  This  means  that  we 
can  split  up  the  summation  required  for  the  marginal  into  a  product  of  independent 
subproblems  as: 


p(ys)  oc  sn  'Jc(ya)  (2.7) 

y\ys  a 

=  n  z  n  ^(y»)  <2’8) 

a£N(s)  yva  ^ 'f)G-Fa 


Denote  each  factor  in  the  above  equation  by  mas,  that  is, 

mas(xs )  =  ^2  II  (2-9) 

y  Va  'bb&Fa 

can  be  thought  of  as  a  message  from  the  factor  a  to  the  variable  s  that  summarizes 
the  impact  of  the  network  upstream  of  a  on  the  belief  in  s.  In  a  similar  fashion,  we 
can  define  messages  from  variables  to  factors  as 

msA(xs)  =  z  n  *b(  yb).  (2.io) 

yvs  Vb£Fs 
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Then,  from  (2.8),  we  have  that  the  marginal  p(ys )  is  proportional  to  the  product  of 
all  the  incoming  messages  to  variable  s.  Similarly,  factor  marginals  can  be  computed 
as 

p(ya)  oc  Ta(ya)  JJmsa(ya).  (2.11) 

s£a 

Here  I  have  treated  a  as  a  set  a  variables  denoting  the  scope  of  factor  Ta.  I  will  do 
this  throughout  this  thesis.  In  addition,  I  will  sometimes  use  the  reverse  notation 
c  3  s  to  mean  the  set  of  all  factors  c  that  contain  the  variable  s. 

Naively  computing  the  messages  according  to  (2.9)  is  impractical,  because  the 
messages  as  we  have  defined  them  require  summation  over  possible  many  variables 
in  the  graph.  Fortunately,  the  messages  can  also  be  written  using  a  recursion  that 
requires  only  local  summation.  The  recursion  is 

mas(xs )  =  E  *°(y*)  II  mta(xt) 

y.W.  «*\.  (212) 

msa(xs)  -  n  mbs(xs) 
b£N(s)\a 

That  this  recursion  matches  the  explicit  definition  of  m  can  be  seen  by  repeated 
substitution,  and  proven  by  induction.  In  a  tree,  it  is  possible  to  schedule  these 
recursions  such  that  the  antecedent  messages  are  always  sent  before  their  dependents, 
by  first  sending  messages  from  the  root,  and  so  on.  This  is  the  algorithm  known  as 
belief  propagation  [91]. 

In  addition  to  computing  single-variable  marginals,  we  will  also  wish  to  compute 
factor  marginals  p(ya)  and  joint  probabilites  p( y)  for  a  given  assignment  y.  (Re¬ 
call  that  the  latter  problem  is  difficult  because  it  requires  computing  the  partition 
function  log  Z .)  First,  to  compute  marginals  over  factors — or  over  any  connected  set 
of  variables,  in  fact — we  can  use  the  same  decomposition  of  the  marginal  as  for  the 
single- variable  case,  and  get 
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(2.13) 


p(ya)  =  K®a(ya)  Ylrrisaiys), 

s£a 


where  k  is  a  proportionality  constant,  computed  to  make  the  distribution  sum  to 

1.  In  fact,  a  similar  idea  works  for  any  connected  set  of  variables — not  just  a  set 
that  happens  to  be  the  domain  of  some  factor — although  if  the  set  is  too  large,  then 
computing  k  is  impractical. 

This  comment  motivates  the  second  problem,  computing  joint  probabilities  p( y). 
Perhaps  the  most  convenient  way  to  do  this  is  to  use  the  fact  that  in  a  tree-structured 
distribution 


p(y) = n^)  n  r 

it  vn  teaPiyt 


(2.14) 

sey  a  iLteaPW) 

This  is  because  any  tree  can  be  represented  as  a  junction  tree  with  one  cluster  for 

each  factor.  Using  this  identity,  we  can  compute  p(y)  (or  log  Z)  from  the  per-variable 
and  per-factor  marginals. 

The  preceding  discussion  assumes  that  the  graph  G  is  a  tree.  If  G  is  not  a  tree, 
the  message  updates  (2.12)  are  no  longer  guaranteed  to  return  the  exact  marginals, 
nor  are  they  guaranteed  even  to  converge,  but  we  can  still  iterate  them  in  an  attempt 
to  find  a  fixed  point.  This  procedure  is  called  loopy  belief  propagation.  To  emphasize 
the  approximate  nature  of  this  procedure,  I  refer  to  the  approximate  marginals  that 
result  from  loopy  BP  as  beliefs  rather  than  as  marginals,  and  denote  them  by  q(ys)- 
Surprisingly,  loopy  BP  can  be  seen  as  a  variational  method  as  follows.  The  general 
variational  idea  is  to: 


1.  Define  a  family  of  tractable  distributions  Q  and  an  objective  function  O(q). 
The  function  O  should  be  designed  to  measure  how  well  a  tractable  distribution 
q  G  Q  approximates  the  distribution  p  of  interest. 

2.  Find  the  “closest”  tractable  distribution  q*  =  min qegO(q). 

3.  Use  the  marginals  of  q*  to  approximate  those  of  p. 
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For  example,  suppose  that  we  take  Q  be  the  set  of  all  possible  distributions  over  y, 
and 


0(q)  =  KL(q\\p)-logZ  (2.15) 

= -H(q) -^q(ya)log^a(ya).  (2.16) 

a 

Then  the  solution  to  this  variational  problem  is  q*  =  p  with  optimal  value  0(q*)  = 
log  Z.  Solving  this  particular  variational  formulation  is  thus  equivalent  to  performing 
exact  inference.  Approximate  inference  techniques  can  be  devised  by  changing  the 
set  Q — for  example,  by  requiring  q  to  be  fully  factorized — or  by  using  a  different 
objective  O. 

With  that  background  on  variational  methods,  let  us  see  how  belief  propagation 
can  be  understood  in  this  framework.  We  make  two  approximations.  First,  we  ap¬ 
proximate  the  entropy  term  H(q)  of  (2.16),  which  as  it  stands  is  difficult  to  compute. 
If  q  were  a  tree-structured  distribution,  then  its  entropy  could  be  written  exactly  as 

#bethe(<?)  =  ^2q(ya)logq(ya)  +  J^(l  -  di)q(yi)  log  q(yi).  (2.17) 

a  i 

This  follows  from  substituting  the  junction-tree  formulation  (2.14)  of  the  joint  into 
the  definition  of  entropy.  If  q  is  not  a  tree,  then  we  can  still  take  HBethe  as  an 
approximation  to  H  to  compute  the  exact  variational  objective  O.  This  yields  the 
Bethe  free  energy. 


Obethe(<?)  =  #bethe(<?)  -  ^2  q(ya">  loS^a(w)  (2-18) 

a 

The  objective  0Bethe  depends  on  q  only  through  its  marginals,  so  rather  than  opti¬ 
mizing  it  over  all  probability  distributions  q,  we  can  optimize  over  the  space  of  all 
marginal  vectors.  Specifically,  every  distribution  q  has  an  associated  belief  vector  q, 
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with  elements  qa-Va  for  each  factor  a  and  assignment  ya,  and  elements  ql]y;  for  each 
variable  i  and  assignment  y% .  The  space  of  all  possible  belief  vectors  has  been  called 
the  marginal  polytope  [138].  However,  for  intractable  models,  the  marginal  polytope 
can  have  extremely  complex  structure. 

This  leads  us  to  the  second  variational  approximation  made  by  loopy  BP,  namely 
that  the  objective  Ob ETHE  is  optimized  instead  over  a  relaxation  of  the  marginal  poly¬ 
tope.  The  relaxation  is  to  require  that  the  beliefs  be  only  locally  consistent,  that  is, 
that 


E  9a(y a)  =  qi(yi )  Va,  i  G  a  (2.19) 

y  a\Vi 

Under  these  constraints,  Yedidia  et  al.  [149]  show  that  constrained  stationary  points 
of  (Pbethe  fixed  points  of  loopy  BP.  So  we  can  view  the  Bethe  energy  (9Qeth e  as  an 
objective  function  that  the  loopy  BP  fixed-point  operations  attempt  to  optimize. 

A  second  way  of  formulating  BP  as  a  variational  algorithm  yields  a  dual  form  of 
the  Bethe  energy  that  will  prove  particularly  useful  [81].  This  dual  energy  arises  from 
the  expectation  propagation  view  of  BP  [77] .  Suppose  we  have  a  set  of  BP  messages 
{rriai},  which  have  not  necessarily  converged.  Then  we  view  the  outgoing  messages 
from  each  factor  as  approximating  it,  that  is, 


^a(ya)  «  ta( Ya)  = 

i£a 


(2.20) 


This  yields  an  approximation  to  the  distribution  p,  namely,  p(y)  ~  q(y)  =  [Qa  fo(ya)- 
Observe  that  q  is  possibly  unnormalized.  As  Minka  [77]  observes,  each  message  update 
of  loopy  BP  can  be  viewed  as  refining  one  of  the  terms  ta  so  that  q  is  closer,  in  terms 
of  KL  divergence. 
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Since  q  was  therefore  chosen  to  approximate  p,  it  makes  sense  to  use  the  mass  of 
q  to  approximate  the  mass  of  p.  More  precisely,  let  p'  be  the  unnormalized  version  of 
p,  that  is,  p'(y)  =  n«  ^a(ya)-  Then  define  rescaled  versions  of  ta  and  q  as 


ta(  ya)  =  sata{  ya)  (2.21) 

9(y)  =  IIf«(y«)  (2-22) 

a 

Then  the  idea  is  to  scale  each  of  the  ta  so  that  the  resulting  J^yq(y)  matches  as 
closely  as  possible  the  partition  function  ^yp'(y).  This  can  be  done  by  optimizing 
local  divergences  in  an  analogous  manner  to  EP.  Define  q\a  as  the  approximating  q 
without  the  factor  ta,  that  is,  each  sa  is  separately  chosen  to  optimize 


min  KL(Ta(ya)g\a(ya)||safa(ya)g\a(ya)).  (2.23) 


Observe  that  because  q^a  depends  on  all  of  the  scale  factors  Sb  for  all  factors  b,  the 
local  objective  function  depends  on  all  of  the  other  scale  factors  as  well.  The  optimal 
sa  is  given  by 


-Sa 


E 


^a(ya) 

y  t(ya) 


e(y) 


Ev?(  y) 


(2.24) 


Thus  the  optimal  sa  actually  does  not  depend  on  the  other  scale  values.  Now  taking 
the  integral  ]Dy  q(y)  yields  the  following  approximation  to  the  partition  function 


-'BetheDual 


n 


Ta(ya) 

i(ya) 


q(y  o) 


(2.25) 


It  can  be  shown  [78]  that  this  is  also  a  free  energy  for  BP,  that  is,  that  fixed  points 
of  BP  are  stationary  points  of  this  objective. 

Another  view  of  BP  is  the  reparameterization  viewpoint  [141].  In  this  view,  the 
BP  updates  are  expressed  solely  in  terms  of  the  beliefs  bn  at  each  iteration  n  of  the 
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algorithm.  The  beliefs  are  initialized  as  bQa  oc  Ta  for  all  factors,  and  b°s  oc  1  for  all 
variables.  The  updates  at  iteration  n  are 


y  n(s) 

K(y  a)  =  B2~\ya,yN(a)) 

yjv(o) 


where  the  distributions  B"  1  and  B”  1  are  defined  as 


Bs  1(ys,yN(S )) 


cc6”-ifc)n6r‘(y 


kn— 1 


a3s 


*  K  1  (y«)  I  [  I  I 


sGa  c3s\a 


(v*) 

bcijc 

bs(ys 


(2.26) 


(2.27) 


(This  notation  is  adapted  from  Rosen- Zvi  et  al.  [103].) 

To  see  how  this  corresponds  to  the  message-based  recursions  (2.12),  consider  the 
following  message  passing  schedule.  At  each  iteration  n,  first  compute  all  of  the 
to-variable  messages  simultaneously  as 


Ks(ys)  =  ^“(y«)  II  mta  1(yt)  (2-28) 

y  a\y3  tea\s 

then  compute  all  of  the  to-factor  messages  as 


™>h(ys)  =  n  mcs(ys)-  (2.29) 

c3s\a 

Then  define  the  beliefs  as  usual:  q™  oc  flags  maS  f°r  variables  s,  and  oc 
Ta  risea  ricos\a  mcs  f°r  all  factors  a.  In  order  to  obtain  a  clean  correspondence  be¬ 
tween  these  reparameterization  updates,  we  also  need  the  assumption  that  the  graphs 
defined  by  B ™_1  and  B”_1  are  trees,  which  always  happens  if  all  factors  are  pairwise. 
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Now  we  can  state  the  correspondence  between  the  message-based  and  reparam¬ 
eterization  viewpoints:  The  beliefs  from  the  message-based  recursions  are  equal  to 
those  from  the  reparameterization-based  recursions.  That  is,  for  all  iterations  m, 
q™  =  bg  and  5"  =  6”.  This  can  be  seen  by  induction,  substituting  the  messages 
corresponding  to  qn~l  into  the  update  equations  (2.26)  for  bn. 

The  reason  for  the  term  “reparameterization”  is  that  at  each  iteration  n,  we  can 
construct  a  distribution  Tn(y)  over  the  full  space  with  factors 


rpn  _  in 

1  s  us’> 


rpn  _ 


K 


n  bn 

1 1  sea  us 


(2.30) 


This  distribution  is  invariant  under  the  message  update,  that  is,  Tn  =  Tn_1  =  •  •  •  = 
T°  =  p.  So  each  Tn  can  be  viewed  as  a  reparameterization  of  the  original  distribution. 
This  view  of  BP  will  prove  especially  useful  in  Section  4.1.3. 


2.2  Applications  of  Graphical  Models 

In  this  section  we  discuss  several  applications  of  graphical  models  to  natural  lan¬ 
guage  processing.  Although  these  examples  are  well-known,  they  serve  both  to  clarify 
the  definitions  in  the  previous  section,  and  to  illustrate  some  ideas  that  will  arise 
again  in  our  discussion  of  conditional  random  fields.  We  devote  special  attention  to 
the  hidden  Markov  model  (HMM),  because  it  is  closely  related  to  the  linear-chain 
CRF. 

2.2.1  Classification 

First  we  discuss  the  problem  of  classification,  that  is,  predicting  a  single  class  vari¬ 
able  y  given  a  vector  of  features  x  =  (aq,  x2,  •  •  • ,  %k)-  One  simple  way  to  accomplish 
this  is  to  assume  that  once  the  class  label  is  known,  all  the  features  are  independent. 
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Figure  2.2.  The  naive  Bayes  classifier,  as  a  directed  model  (left),  and  as  a  factor 
graph  (right). 


The  resulting  classifier  is  called  the  naive  Bayes  classifier.  It  is  based  on  a  joint 
probability  model  of  the  form: 

K 

P(y,  x)  =  p(y)  Y[p(xk\y).  (2.31) 

k= 1 

This  model  can  be  described  by  the  directed  model  shown  in  Figure  2.2  (left).  We  can 
also  write  this  model  as  a  factor  graph,  by  defining  a  factor  T  (y)  =  p(y),  and  a  factor 
^k(y,  Xk)  =  p(xk\y)  for  each  feature  Xk-  This  factor  graph  is  shown  in  Figure  2.2 
(right). 

Another  well-known  classifier  that  is  naturally  represented  as  a  graphical  model  is 
logistic  regression  (sometimes  known  as  the  maximum  entropy  classifier  in  the  NLP 
community).  In  statistics,  this  classifier  is  motivated  by  the  assumption  that  the  log 
probability,  logp(y|x),  of  each  class  is  a  linear  function  of  x,  plus  a  normalization 
constant.  This  leads  to  the  conditional  distribution: 

P(v\ x)  =  exP  | Xy  +  XvPxt^  i  (2-32) 

where  Z(x)  =  )T);/  exp  (A,,  +  Y^j= i  Xv,jxj}  is  a  normalizing  constant,  and  \y  is  a  bias 
weight  that  acts  like  log p(y)  in  naive  Bayes.  Rather  than  using  one  weight  vector 
per  class,  as  in  (2.32),  we  can  use  a  different  notation  in  which  a  single  set  of  weights 
is  shared  across  all  the  classes.  The  trick  is  to  define  a  set  of  feature  functions  that 
are  nonzero  only  for  a  single  class.  To  do  this,  the  feature  functions  can  be  defined 
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as  x)  =  1  {y'=y}Xj  for  the  feature  weights  and  fy>(y,x)  =  l{y/=J/}  for  the  bias 

weights.  Now  we  can  use  fk  to  index  each  feature  function  fy/j,  and  Xk  to  index  its 
corresponding  weight  Xy/j.  Using  this  notational  trick,  the  logistic  regression  model 
becomes: 

P(2/Ix)  =  exp  j^Wfc(fox)j  •  (2-33) 

We  introduce  this  notation  because  it  mirrors  the  usual  notation  for  conditional  ran¬ 
dom  fields. 

2.2.2  Sequence  Models 

Classifiers  predict  only  a  single  class  variable,  but  the  true  power  of  graphical 
models  lies  in  their  ability  to  model  many  variables  that  are  interdependent.  In  this 
section,  we  discuss  perhaps  the  simplest  form  of  dependency,  in  which  the  output 
variables  are  arranged  in  a  sequence.  To  motivate  this  kind  of  model,  we  discuss  an 
application  from  natural  language  processing,  the  task  of  named-entity  recognition 
(NER).  NER  is  the  problem  of  identifying  and  classifying  proper  names  in  text, 
including  locations,  such  as  China ;  people,  such  as  George  Bush ;  and  organizations, 
such  as  the  United  Nations.  The  named-entity  recognition  task  is,  given  a  sentence, 
first  to  segment  which  words  are  part  of  entities,  and  then  to  classify  each  entity  by 
type  (person,  organization,  location,  and  so  on).  The  challenge  of  this  problem  is 
that  many  named  entities  are  too  rare  to  appear  even  in  a  large  training  set,  and 
therefore  the  system  must  identify  them  based  only  on  context. 

One  approach  to  NER  is  to  classify  each  word  independently  as  one  of  either  PER¬ 
SON,  Location,  Organization,  or  Other  (meaning  not  an  entity).  The  problem 
with  this  approach  is  that  it  assumes  that  given  the  input,  all  of  the  named-entity 
labels  are  independent.  In  fact,  the  named-entity  labels  of  neighboring  words  are 
dependent;  for  example,  while  New  York  is  a  location,  New  York  Times  is  an  or¬ 
ganization.  This  independence  assumption  can  be  relaxed  by  arranging  the  output 
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variables  in  a  linear  chain.  This  is  the  approach  taken  by  the  hidden  Markov  model 
(HMM)  [98].  An  HMM  models  a  sequence  of  observations  X  =  {xt}J=i  by  assuming 
that  there  is  an  underlying  sequence  of  states  Y  =  {yt}J=1  drawn  from  a  finite  state 
set  S.  In  the  named-entity  example,  each  observation  xt  is  the  identity  of  the  word 
at  position  t,  and  each  state  yt  is  the  named-entity  label,  that  is,  one  of  the  entity 
types  Person,  Location,  Organization,  and  Other. 

To  model  the  joint  distribution  p( y,  x)  tractably,  an  HMM  makes  two  indepen¬ 
dence  assumptions.  First,  it  assumes  that  each  state  depends  only  on  its  immediate 
predecessor,  that  is,  each  state  yt  is  independent  of  all  its  ancestors  2/1, 2/2,  •  •  • ,  lJt-2 
given  the  preceding  state  yt- 1.  Second,  an  HMM  assumes  that  each  observation  vari¬ 
able  xt.  depends  only  on  the  current  state  yt.  With  these  assumptions,  we  can  specify 
an  HMM  using  three  probability  distributions:  first,  the  distribution  p{y\)  over  ini¬ 
tial  states;  second,  the  transition  distribution  p(yt\yt-i)]  and  finally,  the  observation 
distribution  p(xt\yt)-  That  is,  the  joint  probability  of  a  state  sequence  y  and  an 
observation  sequence  x  factorizes  as 


p(  y>x)  =  Y[p(yt\yt-i)p(xt\yt),  (2.34) 

t= 1 

where,  to  simplify  notation,  we  write  the  initial  state  distribution p(yi)  as p(yi\yo)-  In 
natural  language  processing,  HMMs  have  been  used  for  sequence  labeling  tasks  such 
as  part-of-speech  tagging,  named-entity  recognition,  and  information  extraction. 

2.2.3  Discriminative  and  Generative  Models 

An  important  difference  between  naive  Bayes  and  logistic  regression  is  that  naive 
Bayes  is  generative,  meaning  that  it  is  based  on  a  model  of  the  joint  distribution 
p(y,x),  while  logistic  regression  is  discriminative,  meaning  that  it  is  based  on  a  model 
of  the  conditional  distribution  p(y |x).  In  this  section,  we  discuss  the  differences 
between  generative  and  discriminative  modeling,  and  the  potential  advantages  of 


discriminative  modeling.  For  concreteness,  we  focus  on  the  examples  of  naive  Bayes 
and  logistic  regression,  but  the  discussion  in  this  section  actually  applies  in  general 
to  the  differences  between  generative  models  and  conditional  random  fields. 

The  main  difference  is  that  a  conditional  distribution  p(y|x)  does  not  include 
a  model  of  p(x),  which  is  not  needed  for  classification  anyway.  The  difficulty  in 
modeling  p(x)  is  that  it  often  contains  many  highly  dependent  features  that  are 
difficult  to  model.  For  example,  in  named-entity  recognition,  an  HMM  relies  on 
only  one  feature,  the  word’s  identity.  But  many  words,  especially  proper  names,  will 
not  have  occurred  in  the  training  set,  so  the  word-identity  feature  is  uninformative. 
To  label  unseen  words,  we  would  like  to  exploit  other  features  of  a  word,  such  as 
its  capitalization,  its  neighboring  words,  its  prefixes  and  suffixes,  its  membership  in 
predetermined  lists  of  people  and  locations,  and  so  on. 

The  principal  advantage  of  discriminative  modeling  is  that  it  is  better  suited  to 
including  rich,  overlapping  features.  To  understand  this,  consider  the  family  of  naive 
Bayes  distributions  (2.31).  This  is  a  family  of  joint  distributions  whose  conditionals  all 
take  the  “logistic  regression  form”  (2.33).  But  there  are  many  other  joint  models,  some 
with  complex  dependencies  among  x,  whose  conditional  distributions  also  have  the 
form  (2.33).  By  modeling  the  conditional  distribution  directly,  we  can  remain  agnostic 
about  the  form  of  p(x).  This  may  explain  why  it  has  been  observed  that  conditional 
random  fields  tend  to  be  more  robust  than  generative  models  to  violations  of  their 
independence  assumptions  [58].  Simply  put,  CRTs  make  independence  assumptions 
among  y,  but  not  among  x. 

To  include  interdependent  features  in  a  generative  model,  we  have  two  choices: 
enhance  the  model  to  represent  dependencies  among  the  inputs,  or  make  simplifying 
independence  assumptions,  such  as  the  naive  Bayes  assumption.  The  first  approach, 
enhancing  the  model,  is  often  difficult  to  do  while  retaining  tractability.  For  example, 
it  is  hard  to  imagine  how  to  model  the  dependence  between  the  capitalization  of  a 
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word  and  its  suffixes,  nor  do  we  particularly  wish  to  do  so,  since  we  always  observe  the 
test  sentences  anyway.  The  second  approach — to  include  a  large  number  of  depen¬ 
dent  features  in  a  generative  model,  but  to  include  independence  assumptions  among 
them — is  possible,  and  in  some  domains  can  work  well.  But  it  can  also  be  problematic 
because  the  independence  assumptions  can  hurt  performance.  For  example,  although 
the  naive  Bayes  classifier  performs  well  in  document  classification,  it  performs  worse 
on  average  across  a  range  of  applications  than  logistic  regression  [17]. 

Furthermore,  even  when  naive  Bayes  has  good  classification  accuracy,  its  probabil¬ 
ity  estimates  tend  to  be  poor.  To  understand  why,  imagine  training  naive  Bayes  on  a 
data  set  in  which  all  the  features  are  repeated,  that  is,  x  =  (xi,  xi,  X2,  X2, . . . ,  xk,  xk)- 
This  will  increase  the  confidence  of  the  naive  Bayes  probability  estimates,  even  though 
no  new  information  has  been  added  to  the  data.  Assumptions  like  naive  Bayes  can 
be  especially  problematic  when  we  generalize  to  sequence  models,  because  inference 
essentially  combines  evidence  from  different  parts  of  the  model.  If  probability  esti¬ 
mates  of  the  label  at  each  sequence  position  are  overconfident,  it  might  be  difficult 
to  combine  them  sensibly. 

Actually,  the  difference  between  naive  Bayes  and  logistic  regression  is  clue  only  to 
the  fact  that  the  first  is  generative  and  the  second  discriminative;  the  two  classifiers 
are,  for  discrete  input,  identical  in  all  other  respects.  Naive  Bayes  and  logistic  re¬ 
gression  consider  the  same  hypothesis  space,  in  the  sense  that  any  logistic  regression 
classifier  can  be  converted  into  a  naive  Bayes  classifier  with  the  same  decision  bound¬ 
ary,  and  vice  versa.  Another  way  of  saying  this  is  that  the  naive  Bayes  model  (2.31) 
defines  the  same  family  of  distributions  as  the  logistic  regression  model  (2.33),  if  we 
interpret  it  generatively  as 

exP(Efc  ^A-fex)} 

Es,x  exP  U2kxkfk(y,x)}' 
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Logistic  Regression  Linear-chain  CRFs  unHrno  General  CRFs 


Figure  2.3.  Diagram  of  the  relationship  between  naive  Bayes,  logistic  regression, 
HMMs,  linear-chain  CRFs,  generative  models,  and  general  CRFs. 

This  means  that  if  the  naive  Bayes  model  (2.31)  is  trained  to  maximize  the  conditional 
likelihood,  we  recover  the  same  classifier  as  from  logistic  regression.  Conversely,  if 
the  logistic  regression  model  is  interpreted  generatively,  as  in  (2.35),  and  is  trained  to 
maximize  the  joint  likelihood  p(y,  x),  then  we  recover  the  same  classifier  as  from  naive 
Bayes.  In  the  terminology  of  Ng  and  Jordan  [88] ,  naive  Bayes  and  logistic  regression 
form  a  generative-discriminative  pair. 

One  perspective  for  gaining  insight  into  the  difference  between  generative  and 
discriminative  modeling  is  due  to  Minka  [80].  Suppose  we  have  a  generative  model 
pg  with  parameters  6.  By  definition,  this  takes  the  form 

pg{  y, x;  =  pg{  y;  %9(x  |y;  0).  (2.36) 

But  we  could  also  rewrite  pg  using  Bayes  rule  as 

pg{  y>x;0)  =ib(x;%<?(  ylx;0),  (2.37) 

where  p9(x;  8)  and  p9(y|x;  6)  are  computed  by  inference,  i.e. ,  p9(x;  8)  =  ^2ypg( y,  x;  6) 
and  pg( y|x;  8)  =  pg( y,  x;  8)/pg(: x;  8). 
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Now,  compare  this  generative  model  to  a  discriminative  model  over  the  same 
family  of  joint  distributions.  To  do  this,  we  define  a  prior  p(x)  over  inputs,  such 
that  p(x)  could  have  arisen  from  pg  with  some  parameter  setting.  That  is,  p(x)  = 
pc(x;  O')  =  J2yPg( y>  x|0;).  We  combine  this  with  a  conditional  distribution  pc(y|x;  6) 
that  could  also  have  arisen  from  pg,  that  is,  pc(y|x;$)  =  pg(y,x.]6)/pg(x.-10).  Then 
the  resulting  distribution  is 


Pci. y,  x)  =  pc(x;  0')pc( y|x;  0).  (2.38) 

By  comparing  (2.37)  with  (2.38),  it  can  be  seen  that  the  conditional  approach  has 
more  freedom  to  fit  the  data,  because  it  does  not  require  that  0  =  6'.  Intuitively, 
because  the  parameters  6  in  (2.37)  are  used  in  both  the  input  distribution  and  the 
conditional,  a  good  set  of  parameters  must  represent  both  well,  potentially  at  the 
cost  of  trading  off  accuracy  on  p(y|x),  the  distribution  we  care  about,  for  accuracy 
on  p(x),  which  we  care  less  about.  On  the  other  hand,  this  added  freedom  brings 
about  an  increased  risk  of  overfitting  the  training  data,  and  generalizing  worse  on 
unseen  data. 

So  far  I  have  tried  provide  intuition  on  why  discriminative  models  can  have  better 
accuracy  than  generative  models.  To  be  fair,  however,  generative  models  have  several 
advantages  of  their  own.  First,  generative  models  tend  to  be  able  to  incorporate 
partially-labeled  or  semi-supervised  data  more  naturally,  although  there  has  been 
work  on  incorporating  both  in  discriminative  models.  In  the  most  extreme  case,  when 
the  data  is  entirely  unlabeled,  generative  models  can  be  applied  in  an  unsupervised 
fashion,  which  can  sometimes  yield  insight  into  the  data,  whereas  a  discriminative 
model  is  not  useful  in  this  case.  Second,  on  some  data  a  generative  model  can  perform 
better  than  a  discriminative  model,  intuitively  because  the  input  model  p(x)  may  have 
a  smoothing  effect  on  the  conditional.  Ng  and  Jordan  [88]  argue  that  this  effect  is 
especially  pronounced  when  the  data  set  is  small.  For  any  particular  data  set,  it  is 
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impossible  to  predict  in  advance  whether  a  generative  or  a  discriminative  model  will 
perform  better.  Finally,  sometimes  either  the  problem  suggests  a  natural  generative 
model,  or  the  application  requires  the  ability  to  predict  future  inputs  and  outputs, 
making  a  generative  model  preferable. 

It  is  often  natural  to  represent  generative  models  by  a  directed  graphs  in  which  in 
outputs  y  topologically  precede  the  inputs.  Thus,  the  generative  model  describes  how 
the  outputs  probabilistically  “generate”  the  inputs.  An  example  of  this  is  the  hidden 
Markov  model,  discussed  in  Section  2.2.2.  Similarly,  as  we  will  see,  discriminative 
models  are  often  naturally  described  by  undirected  graphs.  This  correspondence  need 
not  always  hold,  however.  The  key  distinction  is  whether  the  parameters  that  control 
p(x)  and  p(y|x)  are  forced  to  be  identical,  as  in  a  directed  model,  or  are  modeled  as 
independent,  as  in  a  discriminative  model.  Indeed,  hybrids  of  these  two  regimes  are 
also  possible  [59]. 

In  this  section,  we  have  discussed  the  relationship  between  naive  Bayes  and  lo¬ 
gistic  regression  in  detail  because  it  mirrors  the  relationship  between  HMMs  and 
linear-chain  CRFs.  Just  as  naive  Bayes  and  logistic  regression  are  a  generative- 
discriminative  pair,  there  is  a  discriminative  analog  to  hidden  Markov  models,  and 
this  analog  is  a  particular  type  of  conditional  random  held,  as  we  explain  next.  The 
analogy  between  naive  Bayes,  logistic  regression,  generative  models,  and  conditional 
random  fields  is  depicted  in  Figure  2.3. 

2.3  Linear-Chain  Conditional  Random  Fields 

In  the  previous  section,  we  have  seen  advantages  both  to  discriminative  modeling 
and  to  sequence  modeling.  So  it  makes  sense  to  combine  the  two.  This  yields  a  linear- 
chain  CRF,  which  we  describe  in  this  section.  First,  in  Section  2.3.1,  we  define  linear- 
chain  CRFs,  motivating  them  from  HMMs.  Then,  we  discuss  parameter  estimation 
(Section  2.3.2)  and  inference  (Section  2.3.3)  in  linear-chain  CRFs. 
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Figure  2.4.  Graphical  model  of  an  HMM-like  linear-chain  CRF. 


Figure  2.5.  Graphical  model  of  a  linear-chain  CRF  in  which  the  transition  score 
depends  on  the  current  observation. 

2.3.1  From  HMMs  to  CRFs 

To  motivate  our  introduction  of  linear-chain  conditional  random  fields,  we  begin  by 
considering  the  conditional  distribution  p(y|x)  that  follows  from  the  joint  distribution 
p( y,  x)  of  an  HMM.  The  key  point  is  that  this  conditional  distribution  is  in  fact  a 
conditional  random  field  with  a  particular  choice  of  feature  functions. 

First,  we  rewrite  the  HMM  joint  (2.34)  in  a  form  that  is  more  amenable  to  gen¬ 
eralization.  This  is 

p(y>x)  =  7F  exP  "b  ^  1  poi^-{yt=i}^-{xt=o}  (  j 

l  t.  i,jes  t.  ies  oeo  J 

(2.39) 

where  9  =  {\ij,p0i}  are  the  parameters  of  the  distribution,  and  can  be  any  real 
numbers.  Every  HMM  can  be  written  in  this  form,  as  can  be  seen  simply  by  setting 
A ij  =  log p(y'  =  i\y  =  j )  and  so  on.  Because  we  do  not  require  the  parameters  to 
be  log  probabilities,  we  are  no  longer  guaranteed  that  the  distribution  sums  to  1, 
unless  we  explicitly  enforce  this  by  using  a  normalization  constant  Z.  Despite  this 
added  flexibility,  it  can  be  shown  that  (2.39)  describes  exactly  the  class  of  HMMs  in 
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(2.34);  we  have  added  flexibility  to  the  parameterization,  but  we  have  not  added  any 
distributions  to  the  family. 

We  can  write  (2.39)  more  compactly  by  introducing  the  concept  of  feature  func¬ 
tions,  just  as  we  did  for  logistic  regression  in  (2.33).  Each  feature  function  has  the  form 
fk(yt,  yt- 1,  xt).  In  order  to  duplicate  (2.39),  there  needs  to  be  one  feature  fij(y,  y' ,  x )  = 
for  each  transition  ( i,j )  and  one  feature  fio{y,y',x)  =  l{.y=.jjl{3,=0}  for 
each  state-observation  pair  (i,o).  Then  we  can  write  an  HMM  as: 


P(y, x)  =  ^  exp 


K 

E 

k=  1 


^ kfk(yti  yt— i)  xf) 


(2.40) 


Again,  equation  (2.40)  defines  exactly  the  same  family  of  distributions  as  (2.39),  and 
therefore  as  the  original  HMM  equation  (2.34). 

The  last  step  is  to  write  the  conditional  distribution  p(y|x)  that  results  from  the 
HMM  (2.40).  This  is 


p(ylx) 


p(y,x) 

Ey'P(y',x) 


exp  {Ef=i  a kfk(yt ,  yt-i,  ^t)} 
Ey'  exp  {Ef=i  ^kfkiy't,  y[-i,  xt) } 


(2.41) 


This  conditional  distribution  (2.41)  is  a  linear-chain  CRF,  in  particular  one  that 
includes  features  only  for  the  current  word’s  identity.  But  many  other  linear-chain 
CRFs  use  richer  features  of  the  input,  such  as  prefixes  and  suffixes  of  the  current  word, 
the  identity  of  surrounding  words,  and  so  on.  Fortunately,  this  extension  requires  little 
change  to  our  existing  notation.  We  simply  allow  the  feature  functions  fk(yt,  yt- i,  xt) 
to  be  more  general  than  indicator  functions.  This  leads  to  the  general  definition  of 
linear-chain  CRFs,  which  we  present  now. 

Definition  2.2.  Let  Y,X  be  random  vectors,  A  =  { A^}  G  1RA  be  a  parameter  vector, 
and  {fk(y,y',x)}k=1  be  a  set  of  real-valued  feature  functions.  Then  a  linear-chain 
conditional  random  field  is  a  distribution  p(y|x)  that  takes  the  form 
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K 


P(y!x)  =  y  GXP  I  Xkfk(Vti  Vt- 1,  X 


t)  ?  , 


(2.42) 


.  k= 1 


where  Z(x)  is  an  instance-specific  normalization  function 


K 


Z(x)  =  J2  exp  £  ^kfk(yti  Vt—  1)  xt) 


(2.43) 


.  k=  1 


We  have  just  seen  that  if  the  joint  p(y,  x)  factorizes  as  an  HMM,  then  the  asso¬ 
ciated  conditional  distribution  p(y|x)  is  a  linear-chain  CRF.  This  HMM-likc  CRF  is 
pictured  in  Figure  2.4.  Other  types  of  linear-chain  CRFs  are  also  useful,  however. 
For  example,  in  an  HMM,  a  transition  from  state  i  to  state  j  receives  the  same  score, 
log p(yt  =  j\yt-i  =  i),  regardless  of  the  input.  In  a  CRF,  we  can  allow  the  score  of 
the  transition  (i,j)  to  depend  on  the  current  observation  vector,  simply  by  adding  a 
feature  l{2/t=j}l{2/t_1=i}l{Xt=0}.  A  CRF  with  this  kind  of  transition  feature,  which  is 
commonly  used  in  text  applications,  is  pictured  in  Figure  2.5. 

To  indicate  in  the  definition  of  linear-chain  CRF  that  each  feature  function  can 
depend  on  observations  from  any  time  step,  we  have  written  the  observation  argument 
to  fk  as  a  vector  xt,  which  should  be  understood  as  containing  all  the  components 
of  the  global  observations  x  that  are  needed  for  computing  features  at  time  t.  For 
example,  if  the  CRF  uses  the  next  word  xt+\  as  a  feature,  then  the  feature  vector  xt 
is  assumed  to  include  the  identity  of  word  xt+i- 

Finally,  note  that  the  normalization  constant  Z(x)  sums  over  all  possible  state 
sequences,  an  exponentially  large  number  of  terms.  Nevertheless,  it  can  be  computed 
efficiently  by  forward-backward,  as  we  explain  in  Section  2.3.3. 

2.3.2  Parameter  Estimation 

In  this  section  we  discuss  how  to  estimate  the  parameters  6  =  {A*,}  of  a  linear- 
chain  CRF.  We  are  given  iicl  training  data  T>  =  {x^, y^}^,  where  each  x^  = 
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{x^,  Xj  \  . . .  x^}  is  a  sequence  of  inputs,  and  each  yh)  =  {y[l\  y^,  ■  ■ .  yft)  is  a 
sequence  of  the  desired  predictions.  Thns,  we  have  relaxed  the  iid  assumption  within 
each  sequence,  but  we  still  assume  that  distinct  sequences  are  independent.  (In 
Section  2.4,  we  will  see  how  to  relax  this  assumption  as  well.) 

Parameter  estimation  is  typically  performed  by  penalized  maximum  likelihood. 
Because  we  are  modeling  the  conditional  distribution,  the  following  log  likelihood, 
sometimes  called  the  conditional  log  likelihood ,  is  appropriate: 


N 


*(0)  =  £>gp(y«|x«)- 

2=1 


(2.44) 


One  way  to  understand  the  conditional  likelihood  p(y|x;6*)  is  to  imagine  combining 
it  with  some  arbitrary  prior  p(x;  O')  to  form  a  joint  p( y,x).  Then  when  we  optimize 
the  joint  log  likelihood 


logp(y,  x)  =  logp(y |x;  0)  +  logp(x;  O'),  (2.45) 

the  two  terms  on  the  right-hand  side  are  decoupled,  that  is,  the  value  of  O'  does  not 
affect  the  optimization  over  6.  If  we  do  not  need  to  estimate  p(x),  then  we  can  simply 
drop  the  second  term,  which  leaves  (2.44). 

After  substituting  in  the  CRF  model  (2.42)  into  the  likelihood  (2.44),  we  get  the 
following  expression: 

N  T  K  N 

=  ^  'j  ^  ^  ^kfkjVt  *  A„xf>)-£  logZ(x(i)),  (2.46) 

i= 1  t=  1  k= 1  i= 1 

Before  we  discuss  how  to  optimize  this,  we  mention  regularization.  It  is  often  the  case 
that  we  have  a  large  number  of  parameters.  As  a  measure  to  avoid  overfitting,  we 
use  regularization,  which  is  a  penalty  on  weight  vectors  whose  norm  is  too  large.  A 
common  choice  of  penalty  is  based  on  the  Euclidean  norm  of  6  and  on  a  regularization 
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parameter  l/2cr2  that  determines  the  strength  of  the  penalty.  Then  the  regularized 
log  likelihood  is 

N  T  K  N  K  .  2 

«£U0>-£  logZ(xW)  -  £  <2-47> 

i=l  t=l  fc=l  i=l  k= 1 

The  parameter  a2  is  a  free  parameter  which  determines  how  much  to  penalize  large 
weights.  The  notation  for  the  regularizer  is  intended  to  suggest  that  regularization  can 
also  be  viewed  as  performing  maximum  a  posteriori  estimation  of  6 ,  if  9  is  assigned  a 
Gaussian  prior  with  mean  0  and  covariance  a2 1.  Determining  the  best  regularization 
parameter  can  require  a  computationally-intensive  parameter  sweep.  Fortunately, 
often  the  accuracy  of  the  final  model  is  not  sensitive  to  changes  in  a2,  even  when 
cr 2  is  varied  up  to  a  factor  of  10.  An  alternative  choice  of  regularization  is  to  use 
the  L\  norm  instead  of  the  Euclidean  norm,  which  corresponds  to  an  exponential 
prior  on  parameters  [43].  This  regularizer  tends  to  encourage  sparsity  in  the  learned 
parameters.  Many  other  choices  of  regularization  are  possible  as  well. 

In  general,  the  function  i{9)  cannot  be  maximized  in  closed  form,  so  numerical 
optimization  is  used.  The  partial  derivatives  of  (2.47)  are 


di 

d\ k 


N  T 

fk(yll) 


i=  1  t=  1 


N  T 

-  y’'  x-il))p(y,  j/|x(<)) 

i=l  t= 1  y,y' 


\  (2.48) 


The  hrst  term  is  the  expected  value  of  j).  under  the  empirical  distribution: 


p(y,x) 


1 

N 


N 

^  {y=y (i) }  ^  {x=xM  }  • 

i= 1 


(2.49) 


The  second  term,  which  arises  from  the  derivative  of  log  Z(x),  is  the  expectation  of  fk 
under  the  model  distribution  p(y|x;  9)p(x).  Therefore,  at  the  unregularized  maximum 
likelihood  solution,  when  the  gradient  is  zero,  these  two  expectations  are  equal.  This 
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pleasing  interpretation  is  a  standard  result  about  maximum  likelihood  estimation  in 
exponential  families. 

Now  we  discuss  how  to  optimize  £(0).  The  function  £(9)  is  concave,  which  follows 
from  the  convexity  of  functions  of  the  form  g(x)  =  log  exp  x*.  Convexity  is  ex¬ 
tremely  helpful  for  parameter  estimation,  because  it  means  that  every  local  optimum 
is  also  a  global  optimum.  Adding  regularization  ensures  that  £  is  strictly  concave, 
which  implies  that  it  has  exactly  one  global  optimum. 

Perhaps  the  simplest  approach  to  optimize  £  is  steepest  ascent  along  the  gradient 
(2.48),  but  this  requires  too  many  iterations  to  be  practical.  Newton’s  method  con¬ 
verges  much  faster  because  it  takes  into  account  the  curvature  of  the  likelihood,  but  it 
requires  computing  the  Hessian,  the  matrix  of  all  second  derivatives.  The  size  of  the 
Hessian  is  quadratic  in  the  number  of  parameters.  Since  practical  applications  often 
use  tens  of  thousands  or  even  millions  of  parameters,  even  storing  the  full  Hessian  is 
not  practical. 

Instead,  current  techniques  for  optimizing  (2.47)  make  approximate  use  of  second- 
order  information.  Particularly  successful  have  been  quasi-Newton  methods  such 
as  BFGS  [6],  which  compute  an  approximation  to  the  Hessian  from  only  the  first 
derivative  of  the  objective  function.  A  full  K  x  K  approximation  to  the  Hessian  still 
requires  quadratic  size,  however,  so  a  limited-memory  version  of  BFGS  is  used,  due 
to  Byrd  et  al.  [15].  As  an  alternative  to  limited-memory  BFGS,  conjugate  gradient 
is  another  optimization  technique  that  also  makes  approximate  use  of  second-order 
information  and  has  been  used  successfully  with  CRFs.  Either  can  be  thought  of  as 
a  black-box  optimization  routine  that  is  a  drop-in  replacement  for  vanilla  gradient 
ascent.  When  such  second-order  methods  are  used,  gradient-based  optimization  is 
much  faster  than  the  original  approaches  based  on  iterative  scaling  in  Lafferty  et  al. 
[58],  as  shown  experimentally  by  several  authors  [65,  79,  112,  143].  Finally,  trust 
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region  methods  have  recently  been  shown  to  perform  well  on  multinomial  logistic 
regression  [62],  and  may  work  well  for  more  general  CRFs  as  well. 

Recently,  stochastic  gradient  methods,  which  make  updates  based  on  subsets  of 
the  training  instances,  have  been  shown  to  be  highly  effective  [136],  and  may  be  an 
attractive  alternative  to  second-order  methods,  which  tend  to  evaluate  the  gradient 
over  all  the  training  instances  before  making  an  update.  A  promising  new  alterna¬ 
tive  to  stochastic  gradient  methods  is  presented  by  Globerson  et  al.  [42],  They  make 
online-style  updates  in  a  dual  of  the  original  likelihood  rather  than  in  the  primal  rep¬ 
resentation,  which  both  provides  stronger  convergence  guarantees  and  added  stability 
in  practice. 

Finally,  it  is  important  to  remark  on  the  computational  cost  of  training.  Both  the 
partition  function  Z(x)  in  the  likelihood  and  the  marginal  distributions  p(yt,yt- i|x) 
in  the  gradient  can  be  computed  by  forward-backward,  which  uses  computational 
complexity  0(TM2).  However,  each  training  instance  will  have  a  different  partition 
function  and  marginals,  so  we  need  to  run  forward-backward  for  each  training  instance 
for  each  gradient  computation,  for  a  total  training  cost  of  0(TM2NG),  where  N  is  the 
number  of  training  examples,  and  G  the  number  of  gradient  computations  required 
by  the  optimization  procedure.  (Unfortunately,  the  number  of  iterations  G  depends 
on  the  data  set,  and  is  difficult  to  predict  in  advance.  For  batch  L-BFGS  on  linear- 
chain  CRFs,  it  is  usually  but  not  always  under  100.)  For  many  data  sets,  this  cost  is 
reasonable,  but  if  the  number  of  states  is  large,  or  the  number  of  training  sequences  is 
very  large,  then  this  can  become  expensive.  For  example,  on  a  standard  named-entity 
data  set,  with  11  labels  and  200,000  words  of  training  data,  CRF  training  finishes  in 
under  two  hours  on  current  hardware.  However,  on  a  part-of-speech  tagging  data  set, 
with  45  labels  and  one  million  words  of  training  data,  CRF  training  requires  over  a 
week. 
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2.3.3  Inference 


There  are  two  common  inference  problems  for  CRFs.  First,  during  training, 
computing  the  gradient  requires  marginal  distributions  for  each  edge  p(yt,yt- i|x), 
and  computing  the  likelihood  requires  Z(x).  Second,  to  label  an  unseen  instance, 
we  compute  the  most  likely  labeling  y*  =  argmaxyp(y|x).  In  linear-chain  CRFs, 
both  inference  tasks  can  be  performed  efficiently  and  exactly  by  variants  of  the  stan¬ 
dard  dynamic-programming  algorithms  for  HMMs:  forward-backward  for  computing 
marginal  distributions  and  Viterbi  algorithm  for  computing  max-marginals.  In  this 
section,  we  briefly  review  the  HMM  algorithms,  and  extend  them  to  linear-chain 
CRFs.  These  standard  inference  algorithms  are  described  in  more  detail  by  Rabiner 
[98].  Both  of  these  algorithms  are  special  cases  of  the  belief  propagation  algorithm 
described  in  Section  2.1.4,  but  I  discuss  this  special  case  in  detail  both  because  it 
may  help  to  make  the  earlier  discussion  more  concrete,  and  because  it  is  very  useful 
in  practice. 

First,  we  introduce  notation  which  will  simplify  the  forward-backward  recursions. 
An  HMM  can  be  viewed  as  a  factor  graph  p( y,  x)  =  Y\t^ t(yt,  yt-i,  xt)  where  Z  =  1, 
and  the  factors  are  defined  as: 

=  p(yt  =  j\yt- 1  =  i)p{xt  =  x\yt  =  j).  (2.50) 

If  the  HMM  is  viewed  as  a  weighted  finite  state  machine,  then  \Eq(j,  i,  x)  is  the  weight 
on  the  transition  from  state  i  to  state  j  when  the  current  observation  is  x. 

Now,  we  review  the  HMM  forward  algorithm,  which  is  used  to  compute  the  prob¬ 
ability  p(x)  of  the  observations.  The  idea  behind  forward-backward  is  to  first  rewrite 
the  naive  summation  p(x)  =  JJyp(x,  y)  using  the  distributive  law: 
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(2.51) 


T 

pm = e  n  vt-uxt) 

y  t=i 

=  EE  ^ T  (Vt j  Vt-1,  3^'t)  ^  ^  ^t— 1  (Z/t— 1)  Vt—2i  Xt—  i)  ^  ^  •  •  •  (2.52) 

yT  yT_1  J/T-2  J/T-3 

Now  we  observe  that  each  of  the  intermediate  sums  is  reused  many  times  during  the 
computation  of  the  outer  sum,  and  so  we  can  save  an  exponential  amount  of  work  by 
caching  the  inner  sums. 

This  leads  to  defining  a  set  of  forward  variables  at,  each  of  which  is  a  vector  of 
size  M  (where  M  is  the  number  of  states)  which  stores  one  of  the  intermediate  sums. 
These  are  defined  as: 


=  P(x<i...t),yt  =  j)  (2-53) 

t-  i 

=  E  (2.54) 

y<i...t-i>  t'= l 

where  the  summation  over  ranges  over  all  assignments  to  the  sequence  of 

random  variables  2/1, 2/2,  -  -  - ,  Ut-i-  The  alpha  values  can  be  computed  by  the  recursion 

at(j)  =  E  b  (2.55) 

i£S 

with  initialization  ckj (j)  =  (Recall  that  yo  is  the  fixed  initial  state  of 

the  HMM.)  It  is  easy  to  see  that  p(x)  =  qt( Ut )  by  repeatedly  substituting  the 

recursion  (2.55)  to  obtain  (2.52).  A  formal  proof  would  use  induction. 

The  backward  recursion  is  exactly  the  same,  except  that  in  (2.52),  we  push  in  the 
summations  in  reverse  order.  This  results  in  the  definition 


(3t(i)  =  p(x{t+i...T) \yt  =  i)  (2.56) 

T 

=  E  II  (2.57) 

y<t+i...T)  t'=t+ 1 
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and  the  recursion 


Pt{i)  =  ^t+i(j,i,xt+i)Pt+i(j)i  (2-58) 

ies 

which  is  initialized  f3T(i)  =  1.  Analogously  to  the  forward  case,  we  can  compute  p(x) 
using  the  backward  variables  as  p(x)  =  /3o(yo)  =  ^i(yi,  Vo,  xi)Pi(yi)- 

By  combining  results  from  the  forward  and  backward  recursions,  we  can  compute 
the  marginal  distributions  p(yt_i,yt\x)  needed  for  the  gradient  (2.48).  This  can  be 
seen  from  either  the  probabilistic  or  the  factorization  perspectives.  First,  taking  a 
probabilistic  viewpoint  we  can  write 


p{yt-i,yt\x) 


p(x\yt-i,yt)p(yt,yt-i) 

p(x) 

p(x(i...t-i),yt-i)p(yt\yt-i)p(xt\yt)p(x(t+i...T)\yt) 


p(x) 

oc  at-i(yt-i)^t(yuyt-i,xt)/3t(yt), 


(2.59) 

(2.60) 
(2.61) 


where  in  the  second  line  we  have  used  the  fact  that  xp ...t-i)  is  independent  from 
x(t+i...T)  and  from  xt  given  yt_i,yt.  Second,  from  the  factorization  perspective,  we 
can  apply  the  distributive  law  to  obtain  we  see  that 


p(yt-i,yt,x)  =  %(yt,yt- i,xt) 


t- 1 


e  n  '&t'(yt’,w-i,xt>) 

yy<l...t-2)  t’  =  l 

/ 

T 

e  n  ^f(yt>,yf- i,xf)  ]  ,  (2.62) 

iy<t+i...T>  t'=t+  i 


which  can  be  computed  from  the  forward  and  backward  recursions  as 

p(yt-i,yt,x)  =  at-i(yt-i)^t(yt,yt-i,xt)i3t(yt).  (2.63) 
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Once  we  have  p(yt_i,yt,x),  we  can  renormalize  over  yt,yt- 1  to  obtain  the  desired 
marginal  p(yt-i,  yt\x). 

Finally,  to  compute  the  globally  most  probable  assignment  y*  =  argmaxyp(y|x), 
we  observe  that  the  trick  in  (2.52)  still  works  if  all  the  summations  are  replaced  by 
maximization.  This  yields  the  Viterbi  recursion: 


St(j)  =  max^t(j,i,xt)St-i(i) 

i£S 


(2.64) 


Now  that  we  have  described  the  forward-backward  and  Viterbi  algorithms  for 
HMMs,  the  generalization  to  linear-chain  CRFs  is  fairly  straightforward.  The  forward- 
backward  algorithm  for  linear-chain  CRFs  is  identical  to  the  HMM  version,  except 
that  the  transition  weights  Tt(j,  i,  ay)  are  defined  differently.  We  observe  that  the 
CRF  model  (2.42)  can  be  rewritten  as: 

1  T 

p(ylx)  =  y  n  y*~  !> x*)’  (2-65) 

where  we  define 


%{yt,yt-i,x-t)  =  exp 


(2.66) 


With  that  definition,  the  forward  recursion  (2.55),  the  backward  recursion  (2.58), 
and  the  Viterbi  recursion  (2.64)  can  be  used  unchanged  for  linear-chain  CRFs.  Instead 
of  computing  p(x)  as  in  an  HMM,  in  a  CRF  the  forward  and  backward  recursions 
compute  Z(x). 

A  final  inference  task  that  is  useful  in  some  applications  is  to  compute  a  marginal 
probability  p(yt,yt+ 1,  •  •  .yt+k |x)  over  a  possibly  non-contiguous  range  of  nodes.  For 
example,  this  is  useful  for  measuring  the  model’s  confidence  in  its  predicted  labeling 
over  a  segment  of  input.  This  marginal  probability  can  be  computed  efficiently  using 
constrained  forward-backward,  as  described  by  [26]. 
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2.4  CRFs  in  General 


In  this  section,  we  define  CRFs  with  general  graphical  structure,  as  they  were 
introduced  originally  [58].  Although  initial  applications  of  CRFs  used  linear  chains, 
there  have  been  many  later  applications  of  CRFs  with  more  general  graphical  struc¬ 
tures.  Such  structures  are  especially  useful  for  relational  learning,  because  they  allow 
relaxing  the  iid  assumption  among  entities,  or  for  more  complicated  entities,  such  as 
grids  and  trees.  Also,  although  CRFs  have  typically  been  used  for  across-network 
classification,  in  which  the  training  and  testing  data  are  assumed  to  be  independent, 
we  will  see  that  CRFs  can  be  used  for  within- network  classification  as  well,  in  which 
we  model  probabilistic  dependencies  between  the  training  and  testing  data. 

The  generalization  from  linear-chain  CRFs  to  general  CRFs  is  fairly  straightfor¬ 
ward.  We  simply  move  from  using  a  linear-chain  factor  graph  to  a  more  general  factor 
graph,  and  from  forward-backward  to  more  general  (perhaps  approximate)  inference 
algorithms. 


2.4.1  Model 

First  we  present  the  general  definition  of  a  conditional  random  field. 

Definition  2.3.  Let  G  be  a  factor  graph  overY .  Thenp( y|x)  is  a  conditional  random 
field  if  for  any  fixed  x,  the  distribution  p(y|x)  factorizes  according  to  G. 

Thus,  every  conditional  distribution  p(y|x)  is  a  CRF  for  some,  perhaps  trivial, 
factor  graph.  If  F  —  {Ta}  is  the  set  of  factors  in  G,  and  each  factor  takes  the 
exponential  family  form  (2.3),  then  the  conditional  distribution  can  be  written  as 


p(y|x) 


l 

Z(x) 


n  exp 

Va&G 


I<(A) 

^  ^  ^akf akfy ai^-a 
k=  1 


(2.67) 


In  addition,  practical  models  rely  extensively  on  parameter  tying.  For  example,  in 
the  linear-chain  case,  often  the  same  weights  are  used  for  the  factors  'i’tfyt,  at 
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each  time  step.  To  denote  this,  we  partition  the  factors  of  G  into  C  =  {Ci,  C2,  ■  ■  ■  Cp }, 
where  each  Cv  is  a  clique  template  whose  parameters  are  tied.  This  notion  of  clique 
template  generalizes  that  in  Taskar  et  al.  [128],  Sutton  et  al.  [125],  and  Richardson  and 
Domingos  [101].  Each  clique  template  Cp  is  a  set  of  factors  which  has  a  corresponding 
set  of  sufficient  statistics  {/pfc(xp, yp)}  and  parameters  9P  e  IRA^d  Then  the  CRF 
can  be  written  as 

p(ylx)"^n  n  *c(xc,yc;U  (2-68) 

^  '  cp&c  *ceCp 

where  each  factor  is  parameterized  as 


K{p) 


Tc(xc,yc;6»p)  =  exp  E  ■^pk  fpk  (x, .  yc 


k= 1 


(2.69) 


and  the  normalization  function  is 

zW  =  En  n  (2.70) 

y  CpeC  'I'cSCp 

For  example,  in  a  linear-chain  conditional  random  field,  typically  one  clique  tem¬ 
plate  C  =  {^t(l/t,  2/*— i ,  Xt)}*=i  is  used  for  the  entire  network. 

Several  special  cases  of  conditional  random  fields  are  of  particular  interest.  First, 
dynamic  conditional  random  fields  [125]  are  sequence  models  which  allow  multiple 
labels  at  each  time  step,  rather  than  single  labels  as  in  linear-chain  CRFs.  Second, 
relational  Markov  networks  [128]  are  a  type  of  general  CRF  in  which  the  graphical 
structure  and  parameter  tying  are  determined  by  an  SQL-like  syntax.  Finally,  Markov 
logic  networks  [101,  113]  are  a  type  of  probabilistic  logic  in  which  there  are  parameters 
for  each  first-order  rule  in  a  knowledge  base. 
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2.4.2  Applications  of  CRFs 

CRFs  have  been  applied  to  a  variety  of  domains,  including  text  processing,  com¬ 
puter  vision,  and  bioinformatics.  In  this  section,  we  discuss  several  applications, 
highlighting  the  different  graphical  structures  that  occur  in  the  literature. 

One  of  the  first  large-scale  applications  of  CRFs  was  by  Sha  and  Pereira  [112], 
who  matched  state-of-the-art  performance  on  segmenting  noun  phrases  in  text.  Since 
then,  linear-chain  CRFs  have  been  applied  to  many  problems  in  natural  language 
processing,  including  named-entity  recognition  [70],  feature  induction  for  NER  [68], 
identifying  protein  names  in  biology  abstracts  [111],  segmenting  addresses  in  Web 
pages  [27],  finding  semantic  roles  in  text  [106],  identifying  the  sources  of  opinions 
[19],  Chinese  word  segmentation  [92],  Japanese  morphological  analysis  [56],  and  many 
others. 

In  bioinformatics,  CRFs  have  been  applied  to  RNA  structural  alignment  [109] 
and  protein  structure  prediction  [64],  Semi-Markov  CRFs  [108]  add  somewhat  more 
flexibility  in  choosing  features,  which  may  be  useful  for  certain  tasks  in  information 
extraction  and  especially  bioinformatics. 

General  CRFs  have  also  been  applied  to  several  tasks  in  NLP.  One  promising 
application  is  to  performing  multiple  labeling  tasks  simultaneously.  For  example, 
[125]  show  that  a  two-level  dynamic  CRF  for  part-of-speech  tagging  and  noun-phrase 
chunking  performs  better  than  solving  the  tasks  one  at  a  time.  Another  applica¬ 
tion  is  to  multi-label  classification,  in  which  each  instance  can  have  multiple  class 
labels.  Rather  than  learning  an  independent  classifier  for  each  category,  Ghamrawi 
and  McCallum  [40]  present  a  CRF  that  learns  dependencies  between  the  categories, 
resulting  in  improved  classification  performance.  Finally,  the  skip-chain  CRF,  which 
we  present  in  Chapter  3,  is  a  general  CRF  that  represents  long-distance  dependencies 
in  information  extraction. 
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An  interesting  graphical  CRF  structure  has  been  applied  to  the  problem  of  proper- 
noun  coreference,  that  is,  of  determining  which  mentions  in  a  document,  such  as  Mr. 
President  and  he ,  refer  to  the  same  underlying  entity.  McCallum  and  Wellner  [71] 
learn  a  distance  metric  between  mentions  using  a  fully-connected  conditional  random 
held  in  which  inference  corresponds  to  graph  partitioning.  A  similar  model  has  been 
used  to  segment  handwritten  characters  and  diagrams  [23,  96]. 

In  some  applications  of  CRFs,  efficient  dynamic  programs  exist  even  though  the 
graphical  model  is  difficult  to  specify.  For  example,  [73]  learn  the  parameters  of  a 
string-edit  model  in  order  to  discriminate  between  matching  and  nonmatching  pairs  of 
strings.  Also,  there  is  work  on  using  CRFs  to  learn  distributions  over  the  derivations 
of  a  grammar  [21,  102,  117,  135].  A  potentially  useful  unifying  framework  for  this 
type  of  model  is  provided  by  case-factor  diagrams  [67]. 

In  computer  vision,  several  authors  have  used  grid-shaped  CRFs  [46,  57]  for  la¬ 
beling  and  segmenting  images.  Also,  for  recognizing  objects,  Quattoni  et  al.  [97]  use 
a  tree-shaped  CRF  in  which  latent  variables  are  designed  to  recognize  characteristic 
parts  of  an  object. 

2.4.3  Parameter  Estimation 

Parameter  estimation  for  general  CRFs  is  essentially  the  same  as  for  linear-chains, 
except  that  computing  the  model  expectations  requires  more  general  inference  algo¬ 
rithms.  First,  we  discuss  the  fully-observed  case,  in  which  the  training  and  testing 
data  are  independent,  and  the  training  data  is  fully  observed.  In  this  case  the  condi¬ 
tional  log  likelihood  is  given  by 

K(p) 

m  =  £  £  £  Apfc/Pfc(xc,  yc)  -  log  Z(x).  (2.71) 

CpGC  fcSCp  k= 1 
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It  is  worth  noting  that  the  equations  in  this  section  do  not  explicitly  sum  over  training 
instances,  because  if  a  particular  application  happens  to  have  iid  training  instances, 
they  can  be  represented  by  disconnected  components  in  the  graph  G. 

The  partial  derivative  of  the  log  likelihood  with  respect  to  a  parameter  Xpk  asso¬ 
ciated  with  a  clique  template  Cp  is 

r)P 

=  fp^ciYc)-  ^/pfc(xc>ycMyclx)-  (2-72) 

pk  ^ceCp  'J'cSCp  y' 

The  function  P(9)  has  many  of  the  same  properties  as  in  the  linear-chain  case.  First, 
the  zero-gradient  conditions  can  be  interpreted  as  requiring  that  the  sufficient  statis¬ 
tics  Fpfc(x,  y)  =  /pfc(xc,  yc)  have  the  same  expectations  under  the  empirical  dis¬ 
tribution  and  under  the  model  distribution.  Second,  the  function  0.(6)  is  concave,  and 
can  be  efficiently  maximized  by  second-order  techniques  such  as  conjugate  gradient 
and  L-BFGS.  Finally,  regularization  is  used  just  as  in  the  linear-chain  case. 

Now,  we  discuss  the  latent-variable  case,  in  which  the  model  contains  variables 
that  are  observed  at  neither  training  nor  test  time.  It  is  more  difficult  to  train 
CRFs  with  latent  variables,  because  the  latent  variables  need  to  be  marginalized 
out  to  compute  the  likelihood.  Because  of  this  difficultly,  the  original  work  on  CRFs 
focused  on  fully-observed  training  data,  but  recently  there  has  been  increasing  interest 
in  training  latent- variable  CRFs  [73,  97]. 

Suppose  we  have  a  conditional  random  held  with  inputs  x  in  which  the  output 
variables  y  are  observed  in  the  training  data,  but  we  have  additional  variables  w  that 
are  latent,  so  that  the  CRF  has  the  form 

P(y,w|x)  =  — ! “7  II  II  v^/c(xc,  wc,  yc;  6P).  (2.73) 

^X'  cpec  ^C&cp 

A  natural  objective  function  to  maximize  during  training  is  the  marginal  likelihood 
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(2.74) 


Z(9)  =  logp(y|x)  =  log^p(y,w|x). 

W 


The  first  question  is  how  even  to  compute  the  marginal  likelihood  £(0),  because  if 
there  are  many  variables  w,  the  sum  cannot  be  computed  directly.  The  key  is  to 
realize  that  we  need  to  compute  log  ^2wp(y,  w|x)  not  for  any  possible  assignment  y, 
but  only  for  the  particular  assignment  that  occurs  in  the  training  data.  This  motivates 
taking  the  original  CRF  (2.73),  and  clamping  the  variables  Y  to  their  observed  values 
in  the  training  data,  yielding  a  distribution  over  w: 

P(w|y,x)  =  1  Yl  II  v^/c(xc,  wc,  yc;  9P),  (2.75) 

X'  CpeC  fcSCp 

where  the  normalization  factor  is 


%X)  =  EII  Tc(xc,wc,yc;0p).  (2.76) 

w  Cp&C  5>ceCp 

This  new  normalization  constant  Z( y,  x)  can  be  computed  by  the  same  inference 
algorithm  that  we  use  to  compute  Z(x).  In  fact,  Z( y,  x)  is  easier  to  compute,  because 
it  sums  only  over  w,  while  Z(x)  sums  over  both  w  and  y.  Graphically,  this  amounts 
to  saying  that  clamping  the  variables  y  in  the  graph  G  can  simplify  the  structure 
among  w. 

Once  we  have  Z( y,  x),  the  marginal  likelihood  can  be  computed  as 


p(y|x) 


n  n  vbc(xc,  wc,  yc;  6p) 

{  ’  w  CpSCfcSCp 


Zix) 


(2.77) 


Now  that  we  have  a  way  to  compute  £,  we  discuss  how  to  maximize  it  with 
respect  to  6.  Maximizing  £{9)  can  be  difficult  because  £  is  no  longer  convex  in  general 
(intuitively,  log-sum-exp  is  convex,  but  the  difference  of  two  log-sum-exp  functions 
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might  not  be),  so  optimization  procedures  are  typically  guaranteed  to  find  only  local 
maxima.  Whatever  optimization  technique  is  used,  the  model  parameters  must  be 
carefully  initialized  in  order  to  reach  a  good  local  maximum. 

We  discuss  two  different  ways  to  maximize  £:  directly  using  the  gradient,  as  in 
Quattoni  et  al.  [97];  and  using  EM,  as  in  McCallum  et  al.  [73].  To  maximize  £  directly, 
we  need  to  calculate  its  gradient.  The  simplest  way  to  do  this  is  to  use  the  following 
fact.  For  any  function  /(A),  we  have 


d[_ 

dX 


/(A) 


d  log  / 
dX 


(2.78) 


which  can  be  seen  by  applying  the  chain  rule  to  log  /  and  rearranging.  Applying  this 
to  the  marginal  likelihood  £(A)  =  log ^wp(y,  w|x)  yields 


d£ 

dXpk 


Y-  (  1  [p(y>wlx)] 

Ewp(y>wlx)  v  dxPk 

(2.79) 

d 

Vp(w|y,x)  [logp(y,  w  x)] . 

OAph 

w  1 

(2.80) 

This  is  the  expectation  of  the  fully-observed  gradient,  where  the  expectation  is  taken 
over  w.  This  expression  simplifies  to 


d£ 

dXpk 


Eh  p(wcly>x)/fc(yc,xc,w()  -  E  E  P(WcW[JxC)/fc(yc>XC>We)- 

'J'cSCp  w'  'I'cSCp  w'  ,y'c 

(2.81) 


This  gradient  requires  computing  two  different  kinds  of  marginal  probabilities. 
The  first  term  contains  a  marginal  probability  p(w(|y,  x),  which  is  exactly  a  marginal 
distribution  of  the  clamped  CRF  (2.75).  The  second  term  contains  a  different  marginal 
w(,  y(|xc),  which  is  the  same  marginal  probability  required  in  a  fully-observed  CRF. 
Once  we  have  computed  the  gradient,  £  can  be  maximized  by  standard  techniques 
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such  as  conjugate  gradient.  In  our  experience,  conjugate  gradient  tolerates  violations 
of  convexity  better  than  limited-memory  BFGS,  so  it  may  be  a  better  choice  for 
latent- variable  CRFs. 

Alternatively,  t  can  be  optimized  using  expectation  maximization  (EM).  At  each 
iteration  j  in  the  EM  algorithm,  the  current  parameter  vector  6 ^  is  updated  as 
follows.  First,  in  the  E-step,  an  auxiliary  function  q( w)  is  computed  as  q( w)  = 
p(w|y,  x;  9^).  Second,  in  the  M-step,  a  new  parameter  vector  db+i)  js  chosen  as 

5)0+1)  _  argmax'y^g(w/)logp(y,w/|x;d/).  (2.82) 

w' 

The  direct  maximization  algorithm  and  the  EM  algorithm  are  strikingly  similar.  This 
can  be  seen  by  substituting  the  definition  of  q  into  (2.82)  and  taking  derivatives.  The 
gradient  is  almost  identical  to  the  direct  gradient  (2.81).  The  only  difference  is  that 
in  EM,  the  distribution  p(w|y,  x)  is  obtained  from  a  previous,  fixed  parameter  setting 
rather  than  from  the  argument  of  the  maximization.  We  are  unaware  of  any  empirical 
comparison  of  EM  to  direct  optimization  for  latent- variable  CRFs. 

2.4.4  Inference 

In  general  CRFs,  just  as  in  the  linear-chain  case,  gradient-based  training  requires 
computing  marginal  distributions  p(yc|x),  and  testing  requires  computing  the  most 
likely  assignment  y*  =  argmaxyp(y|x).  This  can  be  accomplished  using  any  inference 
algorithm  for  graphical  models.  If  the  graph  has  small  treewidth,  then  the  junction 
tree  algorithm  can  be  used  to  exactly  compute  the  marginals,  but  because  both 
inference  problems  are  NP-hard  for  general  graphs,  this  is  not  always  possible.  In 
such  cases,  approximate  inference  must  be  used  to  compute  the  gradient. 

When  choosing  an  inference  algorithm  to  use  within  CRF  training,  the  important 
thing  to  understand  is  that  it  will  be  invoked  repeatedly,  once  for  each  time  that 
the  gradient  is  computed.  This  can  cause  difficultly  with  sampling-based  approaches, 
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such  as  Markov  chain  Monte  Carlo,  which  may  take  many  iterations  to  converge  for 
each  parameter  setting.  However,  contrastive  divergence  [47] ,  a  more  computationally 
efficient  method  in  which  an  MCMC  sampler  is  run  for  only  a  few  samples,  has  been 
successfully  applied  to  CRFs  in  vision  [46].  Because  of  their  computational  efficiency, 
variational  approaches  can  be  well-suited  for  CRF  training.  Several  authors  [125,  128] 
have  used  loopy  belief  propagation,  described  in  Section  2.1.4. 

2.4.5  Discussion 

This  section  contains  miscellaneous  remarks  about  CRFs.  First,  it  is  easily  seen 
that  logistic  regression  model  (2.33)  is  a  conditional  random  field  with  a  single  output 
variable.  Thus,  CRFs  can  be  viewed  as  an  extension  of  logistic  regression  to  arbitrary 
graphical  structures. 

Linear-chain  CRFs  were  originally  introduced  as  an  improvement  to  the  maximum- 
entropy  Markov  model  (MEMM)  [72],  which  is  essentially  a  Markov  model  in  which 
the  transition  distributions  are  given  by  a  logistic  regression  model.  MEMMs  can 
exhibit  the  problems  of  label  bias  [58]  and  observation  bias  [53] .  Both  of  these  prob¬ 
lems  can  be  readily  understood  graphically:  the  directed  model  of  an  MEMM  implies 
that  for  all  time  steps  t,  the  observation  xt  is  marginally  independent  of  the  labels 
yt-i,yt-2i  and  so  on — an  independence  assumption  which  is  usually  strongly  vio¬ 
lated  in  sequence  modeling.  Sometimes  this  assumption  can  be  effectively  avoided 
by  including  information  from  previous  time  steps  as  features,  and  this  explains  why 
MEMMs  have  had  success  in  some  NLP  applications. 

Although  we  have  emphasized  the  view  of  a  CRF  as  a  model  of  the  conditional 
distribution,  one  could  view  it  as  an  objective  function  for  parameter  estimation 
of  joint  distributions.  As  such,  it  is  one  objective  among  many,  including  generative 
likelihood,  pseudolikelihood  [9],  and  the  maximum-margin  objective  [3,  129].  Another 
related  discriminative  technique  for  structured  models  is  the  averaged  perceptron, 
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which  has  been  especially  popular  in  the  natural  language  community  [22],  in  large 
part  because  of  its  ease  of  implementation.  To  date,  there  has  been  little  careful 
comparison  of  these,  especially  CRFs  and  max-margin  approaches,  across  different 
structures  and  domains. 

Given  this  view,  it  is  natural  to  imagine  training  directed  models  by  conditional 
likelihood,  and  in  fact  this  is  commonly  done  in  the  speech  community,  where  it  is 
called  maximum  mutual  information  training.  However,  it  is  no  easier  to  maximize 
the  conditional  likelihood  in  a  directed  model  than  an  undirected  model,  because  in 
a  directed  model  the  conditional  likelihood  requires  computing  logp(x),  which  plays 
the  same  role  as  Z(x)  in  the  CRF  likelihood.  In  fact,  training  is  more  complex  in  a 
directed  model,  because  the  model  parameters  are  constrained  to  be  probabilities — 
constraints  which  can  make  the  optimization  problem  more  difficult.  This  is  in  stark 
contrast  to  the  joint  likelihood,  which  is  much  easier  to  compute  for  directed  models 
than  undirected  models  (although  recently  several  computationally  efficient  param¬ 
eter  estimation  techniques  have  been  proposed  for  undirected  factor  graphs,  such  as 
Abbeel  et  al.  [1]  and  Wainwright  et  al.  [142]). 

2.4.6  Implementation  Concerns 

There  are  a  few  implementation  techniques  that  can  help  both  training  time  and 
accuracy  of  CRFs,  but  are  not  always  fully  discussed  in  the  literature.  Although  these 
apply  especially  to  language  applications,  they  are  also  useful  more  generally. 

First,  when  the  predicted  variables  are  discrete,  the  features  fpk  are  ordinarily 
chosen  to  have  a  particular  form: 


/Pfc(yc,xc)  l{yc=yc}  9pfc(xc)-  (2.83) 

In  other  words,  each  feature  is  nonzero  only  for  a  single  output  configuration  yc,  but 
as  long  as  that  constraint  is  met,  then  the  feature  value  depends  only  on  the  input 
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observation.  Essentially,  this  means  that  we  can  think  of  our  features  as  depending 
only  on  the  input  xc,  but  that  we  have  a  separate  set  of  weights  for  each  output 
configuration.  This  feature  representation  is  also  computationally  efficient,  because 
computing  each  qpk  may  involve  nontrivial  text  or  image  processing,  and  it  need  be 
evaluated  only  once  for  every  feature  that  uses  it.  To  avoid  confusion,  we  refer  to 
the  functions  qpk(xc)  as  observation  functions  rather  than  as  features.  Examples  of 
observation  functions  are  “word  xt  is  capitalized”  and  “word  xt  ends  in  mg'" . 

This  representation  can  lead  to  a  large  number  of  features,  which  can  have  signif¬ 
icant  memory  and  time  requirements.  For  example,  to  match  state-of-the-art  results 
on  a  standard  natural  language  task,  Sha  and  Pereira  [112]  use  3.8  million  features. 
Not  all  of  these  features  are  ever  nonzero  in  the  training  data.  In  particular,  some 
observation  functions  qvk  are  nonzero  only  for  certain  output  configurations.  This 
point  can  be  confusing:  One  might  think  that  such  features  can  have  no  effect  on  the 
likelihood,  but  actually  they  do  affect  Z(x),  so  putting  a  negative  weight  on  them  can 
improve  the  likelihood  by  making  wrong  answers  less  likely.  In  order  to  save  mem¬ 
ory,  however,  sometimes  these  unsupported  features,  that  is,  those  which  never  occur 
in  the  training  data,  are  removed  from  the  model.  In  practice,  however,  including 
unsupported  features  typically  results  in  better  accuracy. 

In  order  to  get  the  benefits  of  unsupported  features  with  less  memory,  we  have  had 
success  with  an  ad  hoc  technique  for  selecting  a  small  set  of  unsupported  features. 
The  idea  is  to  add  unsupported  features  only  for  likely  paths,  as  follows:  first  train 
a  CRF  without  any  unsupported  features,  stopping  after  a  few  iterations;  then  add 
unsupported  features  /pfc(yc,  xc)  for  cases  where  xc  occurs  in  the  training  data  for 
some  instance  x^,  and  p(yc|xW)  >  e.  McCallum  [68]  presents  a  more  principled 
method  of  feature  selection  for  CRFs. 

Second,  if  the  observations  are  categorical  rather  than  ordinal,  that  is,  if  they 
are  discrete  but  have  no  intrinsic  order,  it  is  important  to  convert  them  to  binary 
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features.  For  example,  it  makes  sense  to  learn  a  linear  weight  on  fk(y,xt )  when  /fc  is 
1  if  is  the  word  dog  and  0  otherwise,  but  not  when  fk  is  the  integer  index  of  word 
xt  in  the  text’s  vocabulary.  Thus,  in  text  applications,  CRF  features  are  typically 
binary;  in  other  application  areas,  such  as  vision  and  speech,  they  are  more  commonly 
real- valued. 

Third,  in  language  applications,  it  is  sometimes  helpful  to  include  redundant  fac¬ 
tors  in  the  model.  For  example,  in  a  linear-chain  CRF,  one  may  choose  to  include 
both  edge  factors  ^t(yt,  Vt-i,  xt)  and  variable  factors  Tt(|/t,xt).  Although  one  could 
define  the  same  family  of  distributions  using  only  edge  factors,  the  redundant  node 
factors  provide  a  kind  of  backoff,  which  is  useful  when  there  is  too  little  data.  In  lan¬ 
guage  applications,  there  is  always  too  little  data,  even  when  hundreds  of  thousands 
of  words  are  available.  It  is  important  to  use  regularization  when  using  redundant 
features  like  this,  because  it  is  the  penalty  on  large  weights  that  encourages  the  weight 
to  be  spread  across  the  overlapping  features. 

Fourth,  sometimes  it  is  preferable  to  use  Lx  regularization  instead  of  L2,  particu¬ 
larly  if  it  is  desired  that  the  trained  weights  be  sparse;  it  also  has  certain  theoretical 
advantages  [87].  The  Li  regularizer  is  not  differentiable  at  0,  which  complicates  nu¬ 
merical  parameter  estimation  somewhat  [4,  43]. 

Finally,  often  the  probabilities  involved  in  forward-backward  and  belief  propaga¬ 
tion  become  too  small  to  be  represented  within  numerical  precision.  There  are  two 
standard  approaches  to  this  common  problem.  One  approach  is  to  normalize  each  of 
the  vectors  at  and  f3t  to  sum  to  1,  thereby  magnifying  small  values.  This  scaling  does 
not  affect  our  ability  to  compute  Z(x).  The  details  of  how  to  do  this  are  given  by 
Rabiner  [98]. 

A  second  approach  is  to  perform  computations  in  the  logarithmic  domain,  e.g., 
the  forward  recursion  becomes 
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(2.84) 


log  at(j)  =  0  ( log  i,xt)  +  logat_i(i)), 

where  ©  is  the  operator  a  ©  b  =  log(ea  +  eb ).  At  first,  this  does  not  seem  much  of  an 
improvement,  since  numerical  precision  is  lost  when  computing  ea  and  eb.  But  ©  can 
be  computed  as 


a  ©  b  =  a  +  log(l  +  eb  a)  =  b  +  log(l  +  ea  b ),  (2.85) 

which  can  be  much  more  numerically  stable,  particularly  if  we  pick  the  version  of  the 
identity  with  the  smaller  exponent.  CRF  implementations  often  use  the  log-space  ap¬ 
proach  because  it  makes  computing  Z(x)  more  convenient,  but  in  some  applications, 
the  computational  expense  of  taking  logarithms  is  an  issue,  making  normalization 
preferable. 

Notes  on  Terminology 

Different  parts  of  the  theory  of  graphical  models  have  been  developed  indepen¬ 
dently  in  many  different  areas,  so  many  of  the  concepts  in  this  chapter  have  different 
names  in  different  areas.  For  example,  undirected  models  are  commonly  also  referred 
to  Markov  random  fields,  Markov  networks,  and  Gibbs  distributions.  As  mentioned, 
I  reserve  the  term  “graphical  model”  for  a  family  of  distributions  defined  by  a  graph 
structure;  “random  held”  or  “distribution”  for  a  single  probability  distribution;  and 
“network”  as  a  term  for  the  graph  structure  itself.  This  choice  of  terminology  is  not 
always  consistent  in  the  literature,  partly  because  it  is  not  ordinarily  necessary  to  be 
precise  in  separating  these  concepts. 

Similarly,  directed  graphical  models  are  commonly  known  as  Bayesian  networks, 
but  I  have  avoided  this  term  because  of  its  confusion  with  the  area  of  Bayesian 
statistics.  The  term  generative  model  is  an  important  one  that  is  commonly  used  in 
the  literature,  but  is  not  usually  given  a  precise  definition. 
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CHAPTER  3 


MODELS 


Although  many  sequence  tagging  and  information  extraction  tasks  have  proven 
amenable  to  linear-chain  CRFs,  higher-order  dependencies  do  exist  in  these  problems. 
In  this  chapter,  I  introduce  two  families  of  loopy  conditional  random  fields  that  model 
limited  forms  of  long-distance  structure,  and  achieve  better  accuracy  as  a  result.  Not 
only  are  these  models  of  practical  interest,  but  also  they  will  prove  useful  for  testing 
the  approximate  training  algorithms  that  I  present  in  the  remainder  of  the  thesis. 

Both  of  the  model  families  that  I  discuss  are  constructed  as  augmentations  of 
linear-chain  CRFs.  First,  I  introduce  the  dynamic  CRF  (Section  3.1),  which  aug¬ 
ments  a  linear-chain  CRF  with  factorized  state.  That  is,  the  state  at  each  sequence 
position  is  represented  as  a  vector  rather  than  as  a  single  variable,  which  allows  the 
transition  distribution  to  be  represented  more  efficiently.  Second,  I  introduce  the 
skip-chain  CRF  (Section  3.2),  which  captures  the  idea  that  similar  tokens  should  re¬ 
ceive  similar  labels,  even  if  they  are  far  apart  in  the  sequence.  This  smoothing  effect 
can  be  achieved  by  adding  long-distance  factors  that  depend  on  the  similar  token 
pairs.  For  both  DCRFs  and  skip-chain  CRFs,  I  explore  the  use  of  approximate  infer¬ 
ence  algorithms,  particularly  loopy  belief  propagation,  during  training.  Approximate 
inference  can  greatly  improve  training  speed  while  maintaining  accuracy;  however,  if 
inference  is  too  inaccurate,  then  the  quality  of  the  solution  can  degrade  markedly. 
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3.1  Dynamic  CRFs 

Many  sequence  labeling  problems  have  a  natural  notion  of  factorized  state.  In  a 
factorized  state  representation,  rather  than  representing  the  state  as  a  single  random 
variable  yt  for  each  sequence  position  t,  the  state  is  represented  as  a  vector  yt  = 
{yn,yt2i  ■  ■  ■  ,Utm},  allowing  the  model  to  be  more  compact  because  it  can  represent 
the  graphical  structure  among  components  of  y.  In  generative  sequence  models,  this 
factorization  is  typically  represented  by  a  dynamic  Bayesian  network  (DBN)  [29,  84], 
which  is  a  directed  graphical  model  whose  structure  is  repeated  across  a  sequence. 
DBNs  have  been  used  for  applications  as  diverse  as  robot  navigation  [133],  audio¬ 
visual  speech  recognition  [86],  activity  recognition  [13],  information  extraction  [93, 
114],  and  automatic  speech  recognition  [10].  DBNs  are  typically  trained  to  maximize 
the  joint  probability  distribution  p( y,  x)  of  a  set  of  observation  sequences  x  and  labels 
y.  As  discussed  in  Section  2.2.3,  however,  when  the  task  does  not  require  the  ability 
to  generate  x,  such  as  in  segmentation  and  labeling,  modeling  the  joint  distribution 
is  a  waste  of  modeling  effort. 

A  solution  to  this  problem  is  to  model  instead  the  conditional  probability  distribu¬ 
tion  p(y|x),  as  in  a  conditional  random  field.  For  this  reason,  we  introduce  dynamic 
CRFs  (DCRFs),  which  are  a  generalization  of  linear-chain  CRFs  that  repeat  struc¬ 
ture  and  parameters  over  a  sequence  of  state  vectors.  This  allows  us  to  both  represent 
distributed  hidden  state  and  complex  interaction  among  labels,  as  in  DBNs,  and  to 
use  rich,  overlapping  feature  sets,  as  in  conditional  models.  For  example,  the  fac¬ 
torial  structure  in  Figure  3.1(b)  includes  links  between  cotemporal  labels,  explicitly 
modeling  limited  probabilistic  dependencies  between  two  different  label  sequences. 
Other  types  of  DCRFs  can  model  higher-order  Markov  dependence  between  labels 
(Figure  3.2),  or  incorporate  a  fixed-size  memory.  For  example,  a  DCRF  for  part-of- 
speech  tagging  could  include  for  each  word  a  hidden  state  that  is  true  if  any  previous 
word  has  been  tagged  as  a  verb. 
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(b) 


Figure  3.1.  Graphical  representation  of  (a)  linear-chain  CRF,  and  (b)  factorial  CRF. 
Although  the  hidden  nodes  can  depend  on  observations  at  any  time  step,  for  clarity 
we  have  shown  links  only  to  observations  at  the  same  time  step. 

Any  DCRF  with  multiple  state  variables  can  be  collapsed  into  a  linear-chain  CRF 
whose  state  space  is  the  cross-product  of  the  outcomes  of  the  original  state  vari¬ 
ables.  However,  such  a  linear-chain  CRF  needs  exponentially  many  parameters  in 
the  number  of  variables.  Like  DBNs,  DCRFs  represent  the  joint  distribution  with 
fewer  parameters  by  exploiting  conditional  independence  relations. 

In  natural-language  processing,  DCRFs  are  especially  attractive  because  they  are 
a  probabilistic  generalization  of  cascaded,  weighted  finite-state  transducers  [82],  In 
general,  many  sequence-processing  problems  are  traditionally  solved  by  chaining  er¬ 
rorful  subtasks,  such  as  chains  of  finite  state  transducers.  In  such  an  approach, 
however,  errors  early  in  processing  nearly  always  cascade  through  the  chain,  causing 
errors  in  the  final  output.  This  problem  can  be  solved  by  jointly  representing  the 
subtasks  in  a  single  graphical  model,  both  explicitly  representing  their  dependence, 
and  preserving  uncertainty  between  them.  DCRFs  can  represent  dependence  between 
subtasks  solved  using  finite-state  transducers,  such  as  phonological  and  morphological 
analysis,  POS  tagging,  shallow  parsing,  and  information  extraction. 

More  specifically,  in  information  and  data  mining,  McCallum  and  Jensen  [69]  argue 
that  the  same  kind  of  probabilistic  unification  can  potentially  be  useful,  because  in 
many  cases,  we  wish  to  mine  a  database  that  has  been  extracted  from  raw  text.  A 
unified  probabilistic  model  for  extraction  and  mining  can  allow  data  mining  to  take 
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Factorial 


Second-order  Markov 


Figure  3.2.  Examples  of  DCRFs.  The  dashed  lines  indicate  the  boundary  between 
time  steps.  The  input  variables  x  are  not  shown. 

into  account  the  uncertainty  in  the  extraction,  and  allow  extraction  to  benefit  from 
emerging  pattern  produced  by  data  mining.  The  applications  here,  in  which  DCRFs 
are  used  to  jointly  perform  multiple  sequence  labeling  tasks,  can  be  viewed  as  an 
initial  step  toward  that  goal. 

In  this  chapter,  we  evaluate  DCRFs  on  several  natural-language  processing  tasks. 
First,  a  factorial  CRF  that  learns  to  jointly  predict  parts  of  speech  and  segment 
noun  phrases  performs  better  than  cascaded  models  that  perform  the  two  tasks  in 
sequence.  Also,  we  compare  several  schedules  for  belief  propagation,  showing  that 
although  exact  inference  is  feasible,  on  this  task  approximate  inference  has  lower  total 
training  time  with  no  loss  in  testing  accuracy. 

In  addition  to  conditional  maximum  likelihood  training,  we  present  an  alternative 
training  method  for  DCRFs,  cascaded  training.  Cascaded  training  is  intended  for 
situations  in  which  a  single  fully-labeled  data  set  is  not  available,  and  instead  the 
outputs  are  partitioned  into  sets  (y0,  yi, . . . ,  y<>),  and  we  have  one  data  set  D0  labeled 
for  y0,  another  data  set  D\  labeled  for  yi,  and  so  on.  For  example,  this  can  be 
the  case  in  transfer  learning ,  in  which  we  wish  to  use  previous  learning  problems 
(that  is,  y0,yi, . . .  ,y^_i)  to  improve  performance  on  a  new  task  yp.  To  handle  the 
fact  that  a  single  fully-labeled  training  set  is  unavailable,  our  procedure  works  in  a 
cascaded  fashion,  in  which  first  we  train  a  CRF  p0  to  predict  yo  on  D0,  then  we 
annotate  D\  with  the  most  likely  prediction  from  po,  then  we  train  a  CRF  p\  on 


62 


p(yi|yo, x),  and  so  on.  Compared  to  other  work  in  transfer  learning,  an  interesting 
aspect  of  this  approach  is  that  the  model  includes  no  shared  latent  structure  between 
subtasks;  rather,  the  probabilistic  dependence  between  tasks  is  modeled  directly.  On 
a  benchmark  information  extraction  task,  we  show  that  a  DCRF  trained  in  a  cascaded 
fashion  performs  better  than  a  linear-chain  CRF  on  the  original  task. 

In  the  rest  of  this  section,  we  first  define  DCRFs  (Section  3.1.1),  explaining  meth¬ 
ods  for  approximate  inference  and  parameter  estimation,  including  parameter  estima¬ 
tion  using  BP  (Section  3.1.4)  and  cascaded  parameter  estimation  (Section  3.1.5).  In 
Section  3.1.6,  we  present  the  experimental  results,  including  evaluation  of  FCRFs  on 
noun-phrase  chunking  (Section  3. 1.6.1),  comparison  of  BP  schedules  in  FCRFs  (Sec¬ 
tion  3. 1.6. 2),  and  cascaded  training  of  DCRFs  for  transfer  learning  (Section  3. 1.6. 3). 

3.1.1  Model  Representation 

A  dynamic  CRF  (DCRF)  is  a  conditional  distribution  that  factorizes  according 
to  an  undirected  graphical  model  whose  structure  and  parameters  are  repeated  over 
a  sequence.  As  with  a  DBN,  a  DCRF  can  be  specified  by  a  template  that  gives 
the  graphical  structure,  features,  and  weights  for  two  time  slices,  which  can  then 
be  unrolled  given  an  input  x.  The  same  set  of  features  and  weights  is  used  at  each 
sequence  position,  so  that  the  parameters  are  tied  across  the  network.  Several  example 
templates  are  given  in  Figure  3.2. 

Now  we  give  a  formal  description  of  the  unrolling  process.  Let  y  =  {yi . . .  y-r}  be 
a  sequence  of  random  vectors  y,;  =  (yn  . . .  yim ),  where  y,:  is  the  state  vector  at  time  i, 
and  yij  is  the  value  of  variable  j  at  time  i.  To  give  the  likelihood  equation  for  arbitrary 
DCRFs,  we  require  a  way  to  describe  a  clique  in  the  unrolled  graph  independent  of 
its  position  in  the  sequence.  For  this  purpose  we  introduce  the  concept  of  a  clique 
index.  Given  a  time  t,  we  can  denote  any  variable  yVJ  in  y  by  two  integers:  its 
index  j  in  the  state  vector  y,,  and  its  time  offset  A t  =  i  —  t.  We  will  call  a  set 
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c  =  {(At,  j)}  of  such  pairs  a  clique  index,  which  denotes  a  set  of  variables  yt)C  by 
yt,c  =  |  (At,  j)  6  c}.  That  is,  yt,c  is  the  set  of  variables  in  the  unrolled  version 

of  clique  index  c  at  time  t. 

Now  we  can  formally  define  DCRFs: 

Definition  3.1.  Let  C  be  a  set  of  clique  indices,  F  =  {/fc(ytiC,  x,  t)}  be  a  set  of  feature 
functions  and  A  =  {A*,}  be  a  set  of  real-valued  weights.  Then  the  distribution  p  is  a 
dynamic  conditional  random  field  if  and  only  if 

P(ylx)  =  nn-p(E  Afc/fc(yc,c,x,t)j  (3.1) 

where  Z(x)  =  Zy  Elf  Ucec  exP  (Sfc  Afc/fc(ytiC,  x,  t))  is  the  partition  function. 

Although  we  define  a  DCRF  has  having  the  same  set  of  features  for  all  the  cliques, 
in  practice  we  choose  feature  functions  fk  so  that  they  are  non-zero  except  on  cliques 
with  some  index  Ck-  Thus,  we  will  sometimes  think  of  each  clique  index  has  having 
its  own  set  of  features  and  weights,  and  speak  of  fk  and  \k  as  having  an  associated 
clique  index  ck- 

DCRFs  generalize  not  only  linear-chain  CRFs,  but  more  complicated  structures 
as  well.  For  example,  in  this  chapter,  we  use  a  factorial  CRF  (FCRF),  which  has 
linear  chains  of  labels,  with  connections  between  cotemporal  labels.  We  name  these 
after  factorial  HMMs  [39].  Figure  3.1(b)  shows  an  unrolled  factorial  CRF.  Consider 
an  FCRF  with  L  chains,  where  Y^t  is  the  variable  in  chain  £  at  time  t.  The  clique 
indices  for  this  DCRF  are  of  the  form  {(0,  £),  (1,  £)}  for  each  of  the  within-chain  edges 
and  {(0,£),  (0,£  +1)}  for  each  of  the  between-chain  edges.  The  FCRF  G  defines  a 
distribution  over  hidden  states  as: 


/T- 1  L 


T  L—l 


p(y|x)  = 


Z(x) 


nn  nn  ^(!/<,t,!/i+i.t,x,f)  ,  (3.2) 


,t=i  e= i 


a= i  e=i 
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where  {tfv}  are  the  potentials  over  the  within-chain  edges,  {db}  are  the  potentials 
over  the  between-chain  edges,  and  Z(x)  is  the  partition  function.  The  potentials 
factorize  according  to  the  features  {/*,}  and  weights  {A*,}  of  G  as: 

$e{ye,t,ye,t+u*;t)  =  exp  kfk{yi,t,ye,t+i,x.,t) 

{  k 

Ve{ye,t,yt+i,t,X;t)  =  exp  <  ^  hfk{ye,t,yi+i,t,x,t) 

{  k 

More  complicated  structures  are  also  possible,  such  as  second-order  CRFs,  and 
hierarchical  CRFs,  which  are  moralized  versions  of  the  hierarchical  HMMs  of  Fine 
et  ah  [31]. 1  As  in  DBNs,  this  factorized  structure  can  use  many  fewer  parameters 
than  the  cross-product  state  space:  even  the  two-level  FCRF  we  discuss  below  uses 
less  than  an  eighth  of  the  parameters  of  the  corresponding  cross-product  CRF. 

3.1.2  Inference  in  DCRFs 

Inference  in  a  DCRF  can  be  done  using  any  inference  algorithm  for  undirected 
models.  For  an  unlabeled  sequence  x,  we  typically  wish  to  solve  two  inference  prob¬ 
lems:  (a)  computing  the  marginals  p(yf)C|x)  over  all  cliques  ytjC,  and  (b)  computing 
the  Viterbi  decoding  y*  =  argmaxyp(y|x).  The  Viterbi  decoding  can  be  used  to 
label  a  new  sequence,  and  marginal  computation  is  used  for  parameter  estimation. 

If  the  number  of  states  is  not  large,  the  simplest  approach  is  to  form  a  linear 
chain  whose  output  space  is  the  cross-product  of  the  original  DCRF  outputs,  and 
then  perform  forward-backward.  In  other  words,  a  DCRF  can  always  be  viewed  as 
a  linear-chain  CRF  whose  feature  functions  take  a  special  form,  analogous  to  the 
relationship  between  generative  DBNs  and  HMMs.  The  cross-product  space  is  often 
very  large,  however,  in  which  case  this  approach  is  infeasible.  Alternatively,  one 

hierarchical  HMMs  were  shown  to  be  DBNs  by  Murphy  and  Paskin  [83]. 


65 


can  perform  exact  inference  by  applying  the  junction  tree  algorithm  to  the  unrolled 
DCRF,  or  by  using  the  special-purpose  inference  algorithms  that  have  been  designed 
for  DBNs  [84],  which  can  avoid  storing  the  full  unrolled  graph. 

In  complex  DCRFs,  though,  exact  inference  can  still  be  expensive,  so  that  approx¬ 
imate  methods  are  necessary.  Furthermore,  because  marginal  computation  is  needed 
during  training,  inference  must  be  efficient  so  that  we  can  use  large  training  sets  even 
if  there  are  many  labels.  The  largest  experiment  reported  here  required  comput¬ 
ing  pairwise  marginals  in  866,792  different  graphical  models:  one  for  each  training 
example  in  each  iteration  of  a  convex  optimization  algorithm. 

We  focus  on  approximate  inference  using  loopy  belief  propagation,  which  was 
described  in  Section  2.1.4.  In  the  experiments  here,  we  pay  special  attention  to 
the  order  in  which  messages  are  propagated.  At  each  iteration  of  belief  propagation, 
messages  can  be  sent  in  any  order,  and  choosing  a  good  schedule  can  affect  how  quickly 
the  algorithm  converges.  We  describe  two  schedules  for  belief  propagation:  tree-based 
and  random.  The  tree-based  schedule,  also  known  as  tree  reparameterization  (TRP) 
[137,  139],  propagates  messages  along  a  set  of  cross-cutting  spanning  trees  of  the 
original  graph.  At  each  iteration  of  TRP,  a  spanning  tree  6  T  is  selected,  and 
messages  are  sent  in  both  directions  along  every  edge  in  which  amounts  to  exact 
inference  on  Many  possible  sets  of  spanning  trees  can  be  imagined,  but  here  we 
select  trees  randomly,  except  that  edges  that  have  never  been  used  in  any  previous 
iteration  are  selected  first. 

The  random  schedule  simply  sends  messages  across  all  edges  in  random  order.  To 
improve  convergence,  we  arbitrarily  order  each  edge  e%  =  (s,,  £j)  and  send  all  messages 
mSi(ti)  before  any  messages  mti(si).  Note  that  for  a  graph  with  V  nodes  and  E  edges, 
TRP  sends  0(V)  messages  per  BP  iteration,  while  the  random  schedule  sends  0(E) 
messages. 
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An  alternative  schedule  is  a  synchronous  schedule,  in  which  conceptually  all  mes¬ 
sages  are  sent  at  the  same  time.  In  the  tree-based  and  random  schedules,  once  a 
message  is  updated,  its  new  values  are  immediately  available  for  other  messages.  In 
the  synchronous  schedule,  on  the  other  hand,  when  computing  a  message  rriu\xv)  at 
iteration  j  of  BP,  the  previous  message  values  rri^  ^  (a;n)  are  always  used,  even  if  an 
updated  value  rtif'1  (xu)  has  been  computed.  We  do  not  report  results  from  the  syn¬ 
chronous  schedule  because  preliminary  experiments  indicated  that  it  requires  many 
more  iterations  to  converge  than  the  other  schedules. 

3.1.3  Parameter  Estimation  in  DCRFs 

Parameter  estimation  of  DCRFs  by  conditional  maximum  likelihood  follows  the 
general  method  explained  in  Section  2.4.3.  Written  in  the  notation  of  this  chapter, 
the  likelihood  is 

£(A)  =  ^logpA(y«  |x«).  (3.3) 

i 

The  derivative  of  this  with  respect  to  a  parameter  A k  associated  with  clique  index  c 
is 


DC 

d\k 


i  t 

I  xw)/fc(yt,c,xw,t). 

i  t  yt,c 


(3.4) 


where  y/(?  is  the  assignment  to  ytjC  in  yW,  and  yAc  ranges  over  assignments  to  the 
clique  c.  Observe  that  it  is  the  factor  p\{yt,c  |  x^)  that  requires  us  to  compute 
marginal  probabilities  in  the  unrolled  DCRF.  As  before,  to  reduce  overfitting,  we 
define  a  spherical  Gaussian  prior  p(A)  over  parameters,  mean  p  =  0  and  covariance 
matrix  E  =  a2/,  so  that  the  gradient  becomes 


dp(A\V)  dC  \k 

d\k  d\k  a2 ' 
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In  the  experiments  here,  we  optimize  the  gradient  using  batch  limited-memory  BFGS. 

3.1.4  Approximate  Parameter  Estimation  Using  BP 

Several  additional  issues  arise  when  loopy  BP  is  used  during  training.  First,  to  sim¬ 
plify  notation  in  this  section,  we  will  write  a  DCRF  as p(y |x)  =  Z (x)~x  Vv(yt.c), 

where  each  factor  in  the  unrolled  DCRF  is  defined  as 


(3.5) 


The  basic  procedure  is  to  optimize  the  likelihood  (3.3)  as  described  in  the  last 
section,  but  instead  of  running  an  exact  inference  algorithm  on  each  training  example 
to  obtain  marginal  distributions  pA(yt,c  |  x^),  we  run  BP  on  each  training  instance  to 
obtain  approximate  beliefs  bt  c( ytjC)  for  each  clique  yt,c  and  approximate  node  belief 
bs(ys)  for  each  output  variable  s. 

Now,  although  BP  provides  approximate  marginal  distributions  that  allow  cal¬ 
culating  the  gradient,  there  is  still  the  issue  of  how  to  calculate  an  approximate 
likelihood.  In  particular,  we  need  an  approximate  objective  function  whose  gradient 
is  equal  to  the  approximate  gradient  we  have  just  described.  We  use  the  approximate 

likelihood 


^(A;  {b})  =  Y  log 


EL  EL  My*, 


(O'! 


M  i  ,  (3-6) 

where  .s  ranges  over  output  variables  (that  is,  components  of  y),  and  ds  is  the  degree 
of  s  (that  is,  the  number  of  factors  fjc>t  that  depend  on  the  variable  s).  In  other  words, 
we  approximate  the  joint  likelihood  by  the  product  over  each  clique’s  approximate 
belief,  dividing  by  the  node  beliefs  to  avoid  overcounting.  In  the  remainder  of  this 
section,  we  justify  this  choice. 


BP  can  be  viewed  as  attempting  to  solve  an  optimization  problem  over  possible 
choices  of  marginal  distributions,  for  a  particular  cost  function  called  the  Bethe  free 
energy.  More  technically,  it  has  been  shown  that  fixed  points  of  BP  are  stationary 


points  of  the  Bethe  free  energy  [for  more  details,  see  150],  when  minimized  over 
locally-consistent  marginal  distributions.  The  Bethe  energy  is  an  approximation  to 
another  cost  function,  which  Yedidia  et  al.  call  the  Helmholtz  free  energy.  Since  the 
minimum  Helmholtz  energy  is  is  exactly  —  logZ(x),  we  approximate  —  logZ(x)  by 
the  minimizing  value  of  the  Bethe  energy,  that  is: 


^Bethe  <A>  =  EEE  log  Vhc (yt,c)  +  mm  ^bethe (b) ,  (3.7) 

i  t  c  i 

where  JFBethe  is  the  Bethe  free  energy,  which  is  defined  as 


Bethe  (^)  EYE  bt,c(yt,c)  log 
t  C  yt,c 


bt,c(yt,c) 

ipt,c(yt,c) 


^(ds  -  1)^2  bs(ys)  log  bs(ys).  (3.8) 


So  approximate  training  with  BP  can  be  viewed  as  solving  a  saddle  point  problem 
of  maximizing  £Bethe  with  respect  to  the  model  parameters  and  minimizing  with  respect 
to  the  beliefs  &t,c(xtjC).  Approximate  training  using  BP  is  just  coordinate  ascent:  BP 
optimizes  £Bethe  with  respect  to  b  for  fixed  A;  and  a  step  along  the  gradient  (3.4) 
optimizes  £Bethe  with  respect  to  A  for  fixed  b.  Taking  the  partial  derivative  of  (3.7) 
with  respect  to  a  weight  A*,,  we  obtain  the  gradient  (3.4)  with  marginal  distributions 
replaced  by  beliefs,  as  desired. 

To  justify  the  approximate  likelihood  (3.6),  we  note  that  the  Bethe  free  energy 
can  be  written  a  dual  form,  in  which  the  variables  are  interpreted  as  log  messages 
rather  than  beliefs.  Details  of  this  are  presented  by  Minka  [78].  Substituting  the 
Bethe  dual  problem  into  (3.7)  and  simplifying  yields  (3.6). 


3.1.5  Cascaded  Parameter  Estimation 

Joint  maximum  likelihood  training  assumes  that  we  have  access  to  data  in  which 
we  have  observed  all  of  the  variables.  Sometimes  this  is  not  the  case.  One  example 
is  transfer  learning ,  which  is  the  general  problem  of  using  previous  learning  problems 
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that  a  system  has  seen  to  aid  its  learning  of  new,  related  tasks.  Usually  in  transfer 
learning,  we  have  one  data  set  labeled  with  the  old  task  variables  and  one  with  the 
new  task  variables,  but  no  data  that  is  jointly  labeled.  In  this  section,  we  describe  a 
cascaded  parameter  estimation  procedure  that  can  be  applied  to  situations  without 
fully-labeled  data. 

For  a  factorial  CRF  with  N  levels,  the  basic  idea  is  to  train  each  level  separately  as 
if  it  were  a  linear-chain  CRF,  using  the  single-best  prediction  of  the  previous  level  as 
a  feature.  At  the  end,  each  set  of  individually-trained  weights  define  a  pair  of  factors, 
which  are  simply  multiplied  together  to  form  the  full  FCRF.  The  cascaded  procedure 
for  an  AI- level  FCRF  is  described  formally  in  Algorithm  3.1.  In  this  description,  the 
within-level  clique  template  for  level  i  has  features  /^(yf ,  yf+i, x,  t)  and  weights  A^; 
and  the  between-level  clique  template  has  features  fjk{yf,yt  ,  x,  t)  and  weights  AJ. 


Algorithm  3.1  Cascaded  training  for  Factorial  CRFs 
1:  Train  a  linear-chain  CRF  on  logp(y0|x),  yielding  weights  A^. 

2:  for  all  levels  i  do 

3:  Compute  Viterbi  labeling  y|_1  =  argmaxy^_1p(y£_i|y|_2,x)  for  each  training 

instance  i. 

4:  Train  a  linear-chain  CRF  to  maximize  logp(y^|y|_1, x),  yielding  weights  A^ 

and  AJ. 

5:  end  for 

6:  return  factorial  CRF  defined  as 

N  T 

p(yix)  °c  (3-9) 

£=0  t= 1 


where 


vI/W(yt,  Vt+ 1, x,  t)  =  exp{^  A k,JJ!k(yl  Vt+ i,  A  *)}  (3-10) 

k 

^P(yl  vt 1.-x,  t)  =  exP{^  KJlkiyl  yt\  x>  *)}  (3-n) 


For  simplicity,  we  have  presented  cascaded  training  for  factorial  CRFs,  but  it  can 
be  generalized  to  other  DCRF  structures,  as  long  as  the  DCRF  templates  can  be 
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Number  of  training  instances 


Figure  3.3.  Performance  of  FCRFs  and  cascaded  approaches  on  noun-phrase  chunk¬ 
ing,  averaged  over  five  repetitions.  The  error  bars  on  FCRF  and  CRF+CRF  indicate 
the  range  of  the  repetitions. 

partitioned  in  a  way  that  respects  the  available  labels.  In  Section  3. 1.6.3,  we  evaluate 
cascaded  training  on  a  transfer  learning  problem. 

3.1.6  Experiments 

We  present  experiments  comparing  factorial  CRFs  to  other  approaches  on  noun¬ 
phrase  chunking  [107].  Also,  we  compare  different  schedules  of  loopy  belief  propaga¬ 
tion  in  factorial  CRFs. 

3. 1.6.1  FCRFs  for  Noun-Phrase  Chunking 

Automatically  finding  the  base  noun  phrases  in  a  sentence  can  be  viewed  as 
a  sequence  labeling  task  by  labeling  each  word  as  either  Begin-Phrase,  Inside- 
Phrase,  or  Other  [99].  The  task  is  typically  performed  by  an  initial  pass  of  part- 
of-speech  tagging,  but  then  it  can  be  difficult  to  recover  from  errors  by  the  tagger. 
In  this  section,  we  address  this  problem  by  performing  part-of-speech  tagging  and 
noun-phrase  segmentation  jointly  in  a  single  factorial  CRF. 
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Size  CRF+CRF  Brill+CRF  FCRF 


POS  accuracy 


Joint  accuracy 


NP  FI 


223 

447 

670 

894 

2234 

8936 

86.23 

90.44 

92.33 

93.56 

96.18 

98.28 

N/A 

93.12 

95.43 

96.34 

96.85 

97.87 

98.92 

223 

81.92 

89.19 

447 

86.58 

91.85 

670 

88.68 

N/A 

92.86 

894 

90.06 

93.60 

2234 

93.00 

94.90 

8936 

95.56 

96.48 

223 

83.84 

86.02 

86.03 

447 

86.87 

88.56 

88.59 

670 

88.19 

89.65 

89.64 

894 

89.21 

90.31 

90.55 

2234 

91.07 

91.90 

92.02 

8936 

93.10 

93.33 

93.87 

Table  3.1.  Performance  comparison  of  cascaded  models  and  FCRFs  on  simultane¬ 
ous  noun-phrase  chunking  and  POS  tagging.  The  column  Size  lists  the  number  of 
sentences  used  in  training.  The  row  CRF+CRF  lists  results  from  cascaded  CRFs, 
and  Brill+CRF  lists  results  from  a  linear-chain  CRF  given  POS  tags  from  the  Brill 
tagger.  The  FCRF  always  outperforms  CRF+CRF,  and  given  sufficient  training  data 
outperforms  Brill+CRF.  With  small  amounts  of  training  data,  Brill+CRF  and  the 
FCRF  perform  comparably,  but  the  Brill  tagger  was  trained  on  over  40,000  sentences, 
including  some  in  the  CoNLL  2000  test  set. 


Our  data  comes  from  the  CoNLL  2000  shared  task  [107],  and  consists  of  sentences 
from  the  Wall  Street  Journal  annotated  by  the  Penn  Treebank  project  [66].  We 
consider  each  sentence  to  be  a  training  instance,  with  single  words  as  tokens.  The 
data  are  divided  into  a  standard  training  set  of  8936  sentences  and  a  test  set  of  2012 
sentences.  There  are  45  different  POS  labels,  and  the  three  NP  labels. 

We  compare  a  factorial  CRF  to  two  cascaded  approaches,  which  we  call  CRF+CRF 
and  Brill+CRF.  CRF+CRF  uses  one  linear-chain  CRF  to  predict  POS  labels,  and 
another  linear-chain  CRF  to  predict  NP  labels,  using  as  a  feature  the  Viterbi  POS 
labeling  from  the  first  CRF.  Brill+CRF  predicts  NP  labels  using  the  POS  labels  pro- 
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wt-s  =  W 

wt  matches  [A-Z]  [a-z]  + 
wt  matches  [A-Z] 
wt  matches  [A-Z]  + 

wt  matches  [A-Z]  +  [a-z]  +  [A-Z]  +  [a-z] 
wt  matches  .  *  [0-9]  .  * 
wt  appears  in  list  of  first  names, 
last  names,  company  names,  days, 
months,  or  geographic  entities 
Wt  is  contained  in  a  lexicon  of  words 
with  POS  T  (from  Brill  tagger) 


Tt  =  T _ 

<7fc(x,  t  +  5)  for  all  k  and  5  G 


-3,3] 


Table  3.2.  Input  features  qk(x.,t)  for  the  CoNLL  data.  In  the  above  Wt  is  the  word 
at  position  t,  Tt  is  the  POS  tag  at  position  t,  w  ranges  over  all  words  in  the  training 
data,  and  T  ranges  over  all  part-of-speech  tags. 

vided  from  the  Brill  tagger,  which  we  expect  to  be  more  accurate  than  those  from  our 
CRF,  because  the  Brill  tagger  was  trained  on  over  four  times  more  data,  including 
sentences  from  the  CoNLL  2000  test  set. 

The  factorial  CRF  uses  the  graph  structure  in  Figure  3.1(b),  with  one  chain  mod¬ 
eling  the  part-of-speech  process  and  the  other  modeling  the  noun-phrase  process.  We 
use  L-BFGS  to  optimize  the  posterior  p(A| T>),  and  TRP  to  compute  the  marginal 
probabilities  required  by  dC/d A&.  Based  on  past  experience  with  linear-chain  CRFs, 
we  use  the  prior  variance  a2  =  10  for  all  models. 

We  factorize  our  features  as  fk(yt,c,  x,  t )  =  Pk(yt,c)qk(x-,  t)  where  Pk(yt,c )  is  a  binary 
function  on  the  assignment,  and  g*.(x,  t)  is  a  function  solely  of  the  input  string.  Table 
3.2  shows  the  features  we  use.  All  three  approaches  use  the  same  features,  with  the 
obvious  exception  that  the  FCRF  and  the  first  stage  of  CRF+CRF  do  not  use  the 
POS  features  Tt  =  T. 

Performance  on  noun-phrase  chunking  is  summarized  in  Table  3.1.  As  usual,  we 
measure  performance  on  chunking  by  precision,  the  percentage  of  returned  phrases 
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that  are  correct;  recall ,  the  percentage  of  correct  phrases  that  were  returned;  and 
their  harmonic  mean  F\.  In  addition,  we  also  report  accuracy  on  POS  labels,2  and 
joint  accuracy  on  (POS,  NP)  pairs.  Joint  accuracy  is  simply  the  number  of  sequence 
positions  for  which  all  labels  were  correct. 

Each  row  in  Table  3.1  is  the  average  of  five  different  random  subsets  of  the  training 
data,  except  for  row  8936,  which  is  run  on  the  single  official  CoNLL  training  set.  All 
conditions  used  the  same  2012  sentences  in  the  official  test  set. 

On  the  full  training  set,  FCRFs  perform  better  on  NP  chunking  than  either  of 
the  cascaded  approaches,  including  Brill+POS.  The  Brill  tagger  [12]  is  an  established 
part-of-speech  tagger  whose  training  set  is  not  only  over  four  times  bigger  than  the 
CoNLL  2000  data  set,  but  also  includes  the  WSJ  corpus  from  which  the  CoNLL  2000 
test  set  was  derived.  The  Brill  tagger  is  97%  accurate  on  the  CoNLL  data.  Also, 
note  that  the  FCRF — which  predicts  both  noun-phrase  boundaries  and  POS — is  more 
accurate  than  a  linear-chain  CRF  which  predicts  only  part-of-speech.  We  conjecture 
that  the  NP  chain  captures  long-run  dependencies  between  the  POS  labels. 

On  smaller  training  subsets,  the  FCRF  outperforms  CRF+CRF  and  performs 
comparably  to  Brill+CRF.  For  all  the  training  subset  sizes,  the  difference  between 
CRF+CRF  and  the  FCRF  is  statistically  significant  by  a  two-sample  f-test  ( p  < 
0.002).  In  fact,  there  was  no  subset  of  the  data  on  which  CRF+CRF  performed 
better  than  the  FCRF.  The  variation  over  the  randomly  selected  training  subsets 
is  small — the  standard  deviation  over  the  Eve  repetitions  has  mean  0.39 — indicating 
that  the  observed  improvement  is  not  due  to  chance.  Performance  and  variance  on 
noun-phrase  chunking  is  shown  in  Figure  3.3. 

2To  simulate  the  effects  of  a  cascaded  architecture,  the  POS  labels  in  the  CoNLL-2000  training 
and  test  sets  were  automatically  generated  by  the  Brill  tagger.  Thus,  POS  accuracy  measures 
agreement  with  the  Brill  tagger,  not  agreement  with  human  judgments. 
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Method 

Time 

(hr) 

NP 

FI 

LBFGS  iter 

h 

s 

h 

s 

h 

Random  (3) 

15.67 

2.90 

88.57 

0.54 

63.6 

Tree  (3) 

13.85 

11.6 

88.02 

0.55 

32.6 

Tree  (oo) 

13.57 

3.03 

88.67 

0.57 

65.8 

Random  (oo) 

13.25 

1.51 

88.60 

0.53 

76.0 

Exact 

20.49 

1.97 

88.63 

0.53 

73.6 

Table  3.3.  Comparison  of  FI  performance  on  the  chunking  task  by  inference  algo¬ 
rithm.  The  columns  labeled  /i  give  the  mean  over  five  repetitions,  and  s  the  sample 
standard  deviation.  Approximate  inference  methods  have  labeling  accuracy  very  sim¬ 
ilar  to  exact  inference  with  lower  total  training  time.  The  differences  in  training  time 
between  Tree  (oo)  and  Exact  and  between  Random  (oo)  and  Exact  are  statistically 
significant  by  a  paired  f-test  (df  =  4 ]p<  0.005). 


On  this  data  set,  several  systems  are  statistically  tied  for  best  performance.  Kudo 
and  Matsumoto  [55]  report  an  FI  of  94.39  using  a  combination  of  voting  support 
vector  machines.  Sha  and  Pereira  [112]  give  a  linear-chain  CRF  that  achieves  an 
FI  of  94.38,  using  a  second-order  Markov  assumption,  and  including  bigram  and 
trigram  POS  tags  as  features.  An  FCRF  imposes  a  first-order  Markov  assumption 
over  labels,  and  represents  dependencies  only  between  cotemporal  POS  and  NP  label, 
not  POS  bigrams  or  trigrams.  Thus,  Sha  and  Pereira’s  results  suggest  that  more 
richly-structured  DCRFs  could  achieve  better  performance  than  an  FCRF. 

Other  DCRF  structures  can  be  applied  to  many  different  language  tasks,  including 
information  extraction.  Peshkin  and  Pfeffer  [93]  apply  a  generative  DBN  to  extraction 
from  seminar  announcements,  attaining  improved  results,  especially  in  extracting 
locations  and  speakers,  by  adding  a  factor  to  remember  the  identity  of  the  last  non- 
background  label. 

3. 1.6. 2  Comparison  of  Inference  Algorithms 

Because  DCRFs  can  have  rich  graphical  structure,  and  require  many  marginal 
computations  during  training,  inference  is  critical  to  efficient  training  with  many 
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labels  and  large  data  sets.  In  this  section,  we  compare  different  inference  methods 
both  on  training  time  and  labeling  accuracy  of  the  final  model. 

Because  exact  inference  is  feasible  for  a  two-chain  FCRF,  this  provides  a  good 
case  to  test  whether  the  final  classification  accuracy  suffers  when  approximate  meth¬ 
ods  are  used  to  calculate  the  gradient.  Also,  we  can  compare  different  methods  for 
approximate  inference  with  respect  to  speed  and  accuracy. 

We  train  factorial  CRFs  on  the  noun-phrase  chunking  task  described  in  the  last 
section.  We  compute  the  gradient  using  exact  inference  and  approximate  belief  prop¬ 
agation  using  both  random  and  tree-based  schedules,  as  described  in  Section  3.1.2. 
Algorithms  are  considered  to  have  converged  when  no  message  changes  by  more  than 
1CT3.  In  these  experiments,  the  approximate  BP  algorithms  always  converged,  al¬ 
though  this  is  not  guaranteed  in  general.  We  trained  on  five  random  subsets  of  5% 
of  the  training  data,  and  the  same  five  subsets  were  used  in  each  condition.  All 
experiments  were  performed  on  a  2.8  GHz  Intel  Xeon  with  4  GB  of  memory. 

For  each  message-passing  schedule,  we  compare  two  termination  conditions:  ter¬ 
minating  on  convergence  (Random(oo)  and  Tree(oo)  in  Table  3.3)  and  terminating 
after  three  iterations  (Random  (3)  and  Tree  (3)).  Although  the  early-terminating 
BP  runs  are  less  accurate,  they  are  faster,  which  we  hypothesized  could  result  in 
lower  overall  training  time.  If  the  gradient  is  too  inaccurate,  however,  then  the  opti¬ 
mization  will  require  many  more  iterations,  resulting  in  greater  training  time  overall, 
even  though  the  time  per  gradient  computation  is  lower.  Another  hazard  is  that  no 
maximizing  step  may  be  possible  along  the  approximate  gradient,  even  if  one  is  pos¬ 
sible  along  the  true  gradient.  In  this  case,  the  gradient  descent  algorithm  terminates 
prematurely,  leading  to  decreased  performance. 

Table  3.3  shows  the  average  FI  score  and  total  training  times  of  DCRFs  trained 
by  the  different  inference  methods.  Unexpectedly,  letting  the  belief  propagation  al¬ 
gorithms  run  to  convergence  led  to  lower  training  time  than  the  early  cutoff.  For 
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example,  even  though  Random(3)  averaged  427  sec  per  gradient  computation  com¬ 
pared  to  571  sec  for  Random(oo),  Random(oo)  took  less  total  time  to  train,  because 
Random(oo)  needed  an  average  of  83.6  gradient  computations  per  training  run,  com¬ 
pared  to  133.2  for  Random  (3). 

As  for  final  classification  performance,  the  various  approximate  methods  and  exact 
inference  perform  similarly,  except  that  Tree(3)  has  lower  final  performance  because 
maximization  ended  prematurely,  averaging  only  32.6  maximizer  iterations.  The  vari¬ 
ance  in  FI  over  the  subsets,  although  not  large,  is  much  larger  than  the  FI  difference 
between  the  inference  algorithms. 

In  all  cases,  the  messages  were  initialized  to  uniform  messages.  One  might  think 
to  take  advantage  of  the  fact  that  BP  is  embedded  in  a  gradient-based  optimizer,  by 
initializing  the  BP  iterations  at  the  final  messages  from  the  previous  gradient  step. 
In  preliminary  experiments,  this  did  not  appreciably  help  early  stopping. 

Previous  work  [137]  has  shown  that  TRP  converges  faster  than  synchronous  belief 
propagation,  that  is,  with  Jacobi  updates.  Both  the  schedules  discussed  in  section 
3.1.2  use  asynchronous  Gauss-Seidel  updates.  We  emphasize  that  the  graphical  mod¬ 
els  in  these  experiments  are  always  pairs  of  coupled  chains.  On  more  complicated 
models,  or  with  a  different  choice  of  spanning  trees,  tree-based  updates  could  out¬ 
perform  random  asynchronous  updates.  Also,  in  complex  models,  the  difference  in 
classification  accuracy  between  exact  and  approximate  inference  could  be  larger,  al¬ 
though  in  such  cases  exact  inference  is  likely  to  be  intractable. 

In  summary,  we  draw  three  conclusions  about  belief  propagation  on  this  particular 
model.  First,  using  approximate  inference  instead  of  exact  inference  leads  to  lower 
overall  training  time  with  no  loss  in  accuracy.  Indeed,  the  two-level  FCRFs  that 
we  consider  here  appear  to  have  been  particularly  easy  cases  for  BP,  because  we 
observed  little  difficulty  with  convergence.  Second,  there  is  little  difference  between 
a  random  tree  schedule  and  a  completely  random  schedule  for  belief  propagation. 
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wt  =  to 

Wt  matches  [A-Z]  [a-z]  + 
wt  matches  [A-Z]  [A-Z]  + 

Wt  matches  [A-Z] 
wt  matches  [A-Z]  + 

Wt  matches  [A-Z]  +  [a-z]  +  [A-Z]  +  [a-z] 

wt  appears  in  list  of  first  names, 
last  names,  honorifics,  etc. 
wt  appears  to  be  part  of  a  time  followed  by  a  dash 
wt  appears  to  be  part  of  a  time  preceded  by  a  dash 
wt  appears  to  be  part  of  a  date 
gfc(x,  t  +  5)  for  all  k  and  5  €  [—4, 4] 


Table  3.4.  Input  features  g*.(x,  £)  for  the  seminars  data.  In  the  above  wt  is  the  word 
at  position  t,  Tt  is  the  POS  tag  at  position  t,  w  ranges  over  all  words  in  the  training 
data,  and  T  ranges  over  all  Penn  Treebank  part-of-speech  tags.  The  “appears  to  be” 
features  are  based  on  hand-designed  regular  expressions  that  can  span  several  tokens. 


Third,  running  belief  propagation  to  convergence  leads  both  to  increased  classification 
accuracy  and  lower  overall  training  time  than  an  early  cutoff. 

3. 1.6. 3  Cascaded  Training  for  Transfer  Learning 

In  this  section,  we  consider  an  application  of  DCRFs  to  transfer  learning,  both  as 
an  additional  application  of  DCRFs,  and  as  an  evaluation  of  the  cascaded  training 
procedure  described  in  Section  3.1.5.  The  task  is  to  extract  the  details  of  an  academic 
seminar — including  its  starting  time,  ending  time,  location,  and  speaker — from  an 
email  announcement.  The  data  is  a  collection  of  485  e-mail  messages  announcing 
seminars  at  Carnegie  Mellon  University,  gathered  by  Freitag  [35],  and  has  been  the 
subject  of  much  previous  work  using  a  wide  variety  of  learning  methods.  Despite  all 
this  work,  however,  the  best  reported  systems  have  precision  and  recall  on  speaker 
names  and  locations  of  only  about  75% — too  low  to  use  in  a  practical  system.  This 
task  is  so  challenging  because  the  messages  are  written  by  many  different  people,  who 
each  have  different  ways  of  presenting  the  announcement  information. 
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System 

stime 

etime 

location 

speaker 

overall 

WHISK 

Soderland  [115] 

92.6 

86.1 

66.6 

18.3 

65.9 

SRV 

Freitag  [35] 

98.5 

77.9 

72.7 

56.3 

76.4 

HMM 

Frietag  and  McCallum  [36] 

98.5 

62.1 

78.6 

76.6 

78.9 

RAPIER 

Califf  and  Mooney  [16] 

95.9 

94.6 

73.4 

53.1 

79.3 

SNOW-IE 

Roth  and  Wen-tau  Yih  [105] 

99.6 

96.3 

75.2 

73.8 

86.2 

(LP)2 

Ciravegna  [20] 

99.0 

95.5 

75.0 

77.6 

86.8 

CRF  (no  transfer) 

This  chapter 

99.1 

97.3 

81.0 

73.7 

87.8 

FCRF  (cascaded) 

This  chapter 

99.2 

96.0 

84.3 

74.2 

88.4 

FCRF  (joint) 

This  chapter 

99.1 

96.0 

85.3 

76.3 

89.2 

Table  3.5.  Comparison  of  F1  performance  on  the  seminars  data.  Joint  decoding 
performs  significantly  better  than  cascaded  decoding.  The  overall  column  is  the 
mean  of  the  other  four.  (This  table  was  adapted  from  Peshkin  and  Pfeffer  [93].) 


Number  of  training  instances 


Figure  3.4.  Learning  curves  for  the  seminars  data  set  on  the  speaker  held,  aver¬ 
aged  over  10-fold  cross  validation.  Joint  training  performs  equivalently  to  cascaded 
decoding  with  25%  more  data. 
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Because  the  task  includes  finding  locations  and  person  names,  the  output  of  a 
named-entity  tagger  is  a  useful  feature.  It  is  not  a  perfectly  indicative  feature,  how¬ 
ever,  because  many  other  kinds  of  person  names  appear  in  seminar  announcements — 
for  example,  names  of  faculty  hosts,  departmental  secretaries,  and  sponsors  of  lecture 
series.  For  example,  the  token  Host:  indicates  strongly  both  that  what  follows  is  a 
person  name,  but  that  person  is  not  the  seminars’  speaker. 

Even  so,  named-entity  predictions  do  improve  performance  on  this  task.  There¬ 
fore,  we  wish  to  transfer  learning  from  the  named-entity  task  to  the  seminar  announce¬ 
ment  task.  To  do  this,  we  define  an  FCRF  that  predicts  both  named-entity  labels  and 
seminar  labels,  training  it  using  cascaded  training  (Section  3.1.5).  Although  on  the 
noun-phrase  chunking  data,  cascaded  training  performs  worse  than  cascaded  training 
and  decoding  (Section  3. 1.6.1),  here  we  do  not  have  a  single  data  set  that  is  labeled 
for  both  tasks.  Performing  joint  inference  over  both  chains  is  therefore  impossible 
during  training;  at  test  time,  however,  we  can  still  perform  joint  inference  over  both 
chains.  We  call  this  procedure  jomt  decoding ,  as  opposed  to  the  cascaded  procedure 
of  using  the  single-best  named-entity  label  in  the  seminar  predictor.  Joint  decoding 
might  be  expected  to  perform  better  because  of  helpful  feedback  between  the  tasks: 
Information  from  the  seminar-field  predictions  can  improve  named-entity  predictions, 
which  in  turn  improve  the  seminar-field  predictions.  Therefore,  we  present  two  com¬ 
parisons:  (a)  between  the  FCRF  trained  to  incorporate  transfer  and  a  comparable 
linear-chain  CRF,  and  (b)  at  test  time,  between  cascaded  decoding  or  joint  decoding. 

We  use  the  predictions  from  a  CRF  named-entity  tagger  that  we  train  on  the  stan¬ 
dard  CoNLL  2003  English  data  set.  The  CoNLL  2003  data  set  consists  of  newswire 
articles  from  Reuters  labeled  as  either  people,  locations,  organizations,  or  miscella¬ 
neous  entities.  It  is  much  larger  than  the  seminar  announcements  data  set.  While  the 
named-entity  data  contains  203,621  tokens  for  training,  the  seminar  announcements 
data  set  contains  only  slightly  over  60,000  training  tokens. 
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Previous  work  on  the  seminars  data  has  used  a  one-field-per-document  evaluation. 
That  is,  for  each  held,  the  CRF  selects  a  single  held  value  from  its  Viterbi  path, 
and  this  extraction  is  counted  as  correct  if  it  exactly  matches  any  of  the  true  held 
mentions  in  the  document.  We  compute  precision  and  recall  following  this  convention, 
and  report  their  harmonic  mean  F\ .  As  in  the  previous  work,  we  use  10-fold  cross 
validation  with  a  50/50  training/test  split.  We  use  a  spherical  Gaussian  prior  on 
parameters  with  variance  a2  =  0.5. 

We  evaluate  whether  joint  decoding  with  cascaded  training  performs  better  than 
cascaded  training  and  decoding.  Table  3.5  compares  cascaded  and  joint  decoding  for 
CRFs  with  other  previous  results  from  the  literature.3  The  features  we  use  are  listed 
in  Table  3.4.  Although  previous  work  has  used  very  different  feature  sets,  we  include 
a  no-transfer  CRF  baseline  to  assess  the  impact  of  transfer  from  the  CoNLL  data  set. 
All  the  CRF  runs  used  exactly  the  same  features. 

On  the  most  challenging  helds,  location  and  speaker,  cascaded  transfer  is  more 
accurate  than  no  transfer  at  all,  and  joint  decoding  is  more  accurate  than  cascaded 
decoding.  In  particular,  for  speaker,  we  see  an  error  reduction  of  8%  by  using  joint 
decoding  over  cascaded.  The  difference  in  FI  between  cascaded  and  joint  decoding 
is  statistically  significant  for  speaker  (paired  f-test;  p  =  0.017)  but  only  marginally 
significant  for  location  ( p  =  0.067).  Our  results  are  competitive  with  previous  work; 
for  example,  on  location,  the  CRF  is  more  accurate  than  any  of  the  existing  systems, 
and  the  CRF  has  the  highest  overall  performance,  that  is,  averaged  over  all  helds, 
than  the  previously  reported  systems. 

Figure  3.4  shows  the  difference  in  performance  between  joint  and  cascaded  decod¬ 
ing  as  a  function  of  training  set  size.  Cascaded  decoding  with  the  full  training  set  of 

3We  omit  one  relevant  paper  [93]  because  its  evaluation  method  differs  from  all  the  other  previous 
work. 
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242  emails  performs  equivalently  to  joint  decoding  on  only  181  training  instances,  a 
25%  reduction  in  the  training  set. 

Examining  the  trained  models,  we  can  observe  errors  made  by  the  general-purpose 
named  entity  tagger,  and  how  they  can  be  corrected  by  considering  the  seminars 
labels.  In  newswire  text,  long  runs  of  capitalized  words  are  rare,  often  indicating  the 
name  of  an  entity.  In  email  announcements,  runs  of  capitalized  words  are  common  in 
formatted  text  blocks  like: 

Location:  Baker  Hall 

Host :  Michael  Erdmann 

In  this  type  of  situation,  the  general  named  entity  tagger  often  mistakes  Host:  for 
the  name  of  an  entity,  especially  because  the  word  preceding  Host  is  also  capitalized. 
On  one  of  the  cross-validated  testing  sets,  of  80  occurrences  of  the  word  Host:,  the 
named-entity  tagger  labels  52  as  some  kind  of  entity.  When  joint  decoding  is  used, 
however,  only  20  occurrences  are  labeled  as  entities.  Recall  that  in  both  of  these 
settings,  training  is  performed  in  exactly  the  same  way;  the  only  difference  is  that 
joint  decoding  takes  into  account  information  about  the  seminar  labels  when  choosing 
named-entity  labels.  This  is  an  example  of  how  domain-specific  information  from  the 
main  task  can  improve  performance  on  a  more  standard,  general-pnrpose  subtask. 

3.1.7  Related  Work 

Since  the  original  work  on  conditional  random  Helds  [58],  there  has  been  much 
interest  in  training  discriminative  models  with  more  general  graphical  structures.  One 
of  the  first  such  applications  was  relational  Markov  networks  [128],  which  were  first 
applied  to  collective  classification  of  Web  pages.  There  has  also  been  interest  in  grid- 
structured  loopy  CRFs  for  computer  vision  [46,  57],  in  which  jointly-trained  Markov 
random  fields  are  a  classical  technique.  Another  type  of  structured  problem  which  has 
seen  some  attention  in  the  literature  is  discriminative  learning  of  distributions  over 
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context-free  parse  trees,  in  which  training  has  done  done  using  max-margin  methods 
[74,  130]  and  perceptron-like  methods  [135]. 

Currently,  the  most  popular  alternative  approaches  to  training  structured  discrim¬ 
inative  models  are  maximum-margin  training  [3,  129],  and  perceptron  training  [22], 
which  has  been  especially  popular  in  NLP  because  of  its  ease  of  implementation. 

The  factorial  CRF  that  we  present  here  should  not  be  confused  with  the  factorial 
Markov  random  fields  that  have  been  proposed  in  the  computer  vision  community 
[52],  In  that  model,  each  of  the  factors  is  a  grid,  rather  than  a  chain,  and  they  interact 
through  a  directed  model,  as  in  a  factorial  HMM. 

Finally,  some  results  presented  here  have  appeared  in  earlier  conference  versions, 
in  particular  the  results  on  noun-phrase  chunking  [125]  and  transfer  learning  [120]. 

3.2  Skip-chain  CRFs 

Another  type  of  long-range  dependence  that  arises  in  information  extraction  oc¬ 
curs  occurs  on  repeated  mentions  of  the  same  field.  When  the  same  entity  is  men¬ 
tioned  more  than  once  in  a  document,  such  as  Robert  Booth ,  in  many  cases  all  men¬ 
tions  have  the  same  label,  such  as  Seminar- Speaker.  We  can  take  advantage  of 
this  fact  by  favoring  labelings  that  treat  repeated  words  identically,  and  by  combining 
features  from  all  occurrences  so  that  the  extraction  decision  can  be  made  based  on 
global  information.  Furthermore,  identifying  all  mentions  of  an  entity  can  be  useful 
in  itself,  because  each  mention  might  contain  different  useful  information.  However, 
most  extraction  systems,  whether  probabilistic  or  not,  do  not  take  advantage  of  this 
dependency,  instead  treating  the  separate  mentions  independently. 

To  perform  collective  labeling,  we  need  to  represent  dependencies  between  distant 
terms  in  the  input.  But  this  reveals  a  general  limitation  of  sequence  models,  whether 
generatively  or  discriminativcly  trained.  Sequence  models  make  a  Markov  assumption 
among  labels,  that  is,  that  any  label  yt  is  independent  of  all  previous  labels  given 
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Figure  3.5.  Graphical  representation  of  a  skip-chain  CRF.  Identical  words  are  con¬ 
nected  because  they  are  likely  to  have  the  same  label. 

its  immediate  predecessors  yt-k  ■  ■  ■  Ut-i-  This  represents  dependence  only  between 
nearby  nodes — for  example,  between  bigrams  and  trigrams — and  cannot  represent 
the  higher-order  dependencies  that  arise  when  identical  words  occur  throughout  a 
document. 

To  relax  this  assumption,  we  introduce  the  skip- chain  CRF ,  a  conditional  model 
that  collectively  segments  a  document  into  mentions  and  classifies  the  mentions  by 
entity  type,  while  taking  into  account  probabilistic  dependencies  between  distant 
mentions.  These  dependencies  are  represented  in  a  skip-chain  model  by  augmenting 
a  linear-chain  CRF  with  factors  that  depend  on  the  labels  of  distant  but  similar  words. 
This  is  shown  graphically  in  Figure  3.5. 

Even  though  the  limitations  of  n- gram  models  have  been  widely  recognized  within 
the  natural  language  processing  community,  long-distance  dependencies  are  difficult 
to  represent  in  generative  models,  because  full  n-grarn  models  have  too  many  param¬ 
eters  if  n  is  large.  We  avoid  this  problem  by  selecting  which  skip  edges  to  include 
based  on  the  input  string.  This  kind  of  input-specific  dependence  is  difficult  to  repre¬ 
sent  in  a  generative  model,  because  it  makes  generating  the  input  more  complicated. 
In  other  words,  conditional  models  have  been  popular  because  of  their  flexibility  in 
allowing  overlapping  features;  skip-chain  CRFs  take  advantage  of  their  flexibility  in 
allowing  input-specific  model  structure. 
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3.2.1  Model 


The  skip-chain  CRF  is  essentially  a  linear-chain  CRF  with  additional  long-distance 
edges  between  similar  words.  We  call  these  additional  edges  skip  edges.  The  features 
on  skip  edges  can  incorporate  information  from  the  context  of  both  endpoints,  so  that 
strong  evidence  at  one  endpoint  can  influence  the  label  at  the  other  endpoint. 

When  applying  the  skip-chain  model,  we  must  choose  which  skip  edges  to  include. 
The  simplest  choice  is  to  connect  all  pairs  of  identical  words,  but  more  generally  we 
can  connect  any  pair  of  words  that  we  believe  to  be  similar,  for  example,  pairs  of 
words  that  belong  to  the  same  stem  class,  or  have  small  edit  distance.  In  addition, 
we  must  be  careful  not  to  include  too  many  skip  edges,  because  this  could  result 
in  a  graph  that  makes  approximate  inference  difficult.  So  we  need  to  use  similarity 
metrics  that  result  in  a  sufficiently  sparse  graph.  In  the  experiments  below,  we  focus 
on  named-entity  recognition,  so  we  connect  pairs  of  identical  capitalized  words. 

Formally,  the  skip-chain  CRF  is  defined  as  a  general  CRF  with  two  clique  tem¬ 
plates:  one  for  the  linear-chain  portion,  and  one  for  the  skip  edges.  For  an  sentence 
x,  let  X  =  {(u,u)}  be  the  set  of  all  pairs  of  sequence  positions  for  which  there  are 
skip  edges.  For  example,  in  the  experiments  reported  here,  X  is  the  set  of  indices  of 
all  pairs  of  identical  capitalized  words.  Then  the  probability  of  a  label  sequence  y 
given  an  input  x  is  modeled  as 

1  T 

Mylx)  =  zix)  II  V™(yu,yv,x),  (3.12) 

^  <=1  (u,v)£l 

where  dq  are  the  factors  for  linear-chain  edges,  and  xVul,  are  the  factors  over  skip 
edges.  These  factors  are  defined  as 
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(3.13) 


vt(yt,yt- i,x)  =  exp 


Vuv(yu,yv,x)  =  exp 


(3.14) 


where  6\  =  {\\k)k=i  are  the  parameters  of  the  linear-chain  template,  and  02  = 
{A2fc}^1  are  the  parameters  of  the  skip  template.  The  full  set  of  model  parameters 
are  6  =  {6i,  d2}. 

As  described  in  Section  2.4.6,  both  the  linear-chain  features  and  skip-chain  features 
are  factorized  into  indicator  functions  of  the  outputs  and  observation  functions,  as  in 
(2.83).  In  general  the  observation  functions  g*.(x,  t)  can  depend  on  arbitrary  positions 
of  the  input  string.  For  example,  a  useful  feature  for  NER  is  ^(x,  t)  =  1  if  and  only 
if  xt+i  is  a  capitalized  word. 

The  observation  functions  for  the  skip  edges  are  chosen  to  combine  the  observa¬ 
tions  from  each  endpoint.  Formally,  we  define  the  feature  functions  for  the  skip  edges 
to  factorize  as: 

fk(yu,yv,x,u,v)  =  l{yK=yu}l{yv=yv}q'k(*,u,v)  (3.15) 

This  choice  allows  the  observation  functions  qk(x.,u,v)  to  combine  information  from 
the  neighborhood  of  yu  and  yv.  For  example,  one  useful  feature  is  qk{x.,u,v)  =  1  if 
and  only  if  xu  =  xv  =  “Booth”  and  xv-\  =  “Speaker:”.  This  can  be  a  useful  feature  if 
the  context  around  xu,  such  as  “Robert  Booth  is  manager  of  control  engineering. . . ,” 
may  not  make  clear  whether  or  not  Robert  Booth  is  presenting  a  talk,  but  the  context 
around  xv  is  clear,  such  as  “Speaker:  Robert  Booth.”  4 


4This  example  is  taken  from  an  actual  error  made  by  a  linear-chain  CRF  on  the  seminars  data 
set.  We  present  results  from  this  data  set  in  Section  3.2.3. 


System 

stime 

etime 

location 

speaker 

overall 

BIEN  [93] 

96.0 

98.8 

87.1 

76.9 

89.7 

Linear-chain  CRF 

97.5 

97.5 

88.3 

77.3 

90.2 

Skip-chain  CRF 

96.7 

97.2 

88.1 

80.4 

90.6 

Table  3.6.  Comparison  of  F\  performance  on  the  seminars  data.  The  top  line  gives 
a  dynamic  Bayes  net  that  has  been  previously  used  on  this  data  set.  The  skip-chain 
CRF  beats  the  previous  systems  in  overall  FI  and  on  the  speaker  held,  which  has 
proved  to  be  the  hardest  held  of  the  four.  Overall  FI  is  the  average  of  the  FI  scores 
for  the  four  helds. 


Field 

Linear-chain 

Skip-chain 

stime 

12.6 

17 

etime 

3.2 

5.2 

location 

6.4 

0.6 

speaker 

30.2 

4.8 

Table  3.7.  Number  of  inconsistently  mislabeled  tokens,  that  is,  tokens  that  are 
mislabeled  even  though  the  same  token  is  labeled  correctly  elsewhere  in  the  document. 
Learning  long-distance  dependencies  reduces  this  kind  of  error  in  the  speaker  and 
location  helds.  Numbers  are  averaged  over  5  folds. 


3.2.2  Parameter  Estimation 

Because  the  loops  in  a  skip-chain  CRF  can  be  long  and  overlapping,  exact  infer¬ 
ence,  and  hence  maximum  likelihood,  is  intractable  for  the  models  considered  here. 
The  running  time  required  by  exact  inference  is  exponential  in  the  size  of  the  largest 
clique  in  the  graph’s  junction  tree.  In  junction  trees  created  from  the  seminars  data, 
29  of  the  485  instances  have  a  maximum  clique  size  of  10  or  greater,  and  11  have 
a  maximum  clique  size  of  14  or  greater.  (The  worst  instance  has  a  clique  with  61 
nodes.)  These  cliques  are  far  too  large  to  perform  inference  exactly.  For  reference, 
representing  a  single  factor  that  depends  on  14  variables  requires  more  memory  than 
can  be  addressed  in  a  32-bit  architecture. 

Instead,  we  perform  approximate  inference  using  loopy  belief  propagation,  which 
was  described  in  Section  2.4.4.  As  with  the  FCRFs  in  Section  3.1,  inference  uses  the 
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TRP  schedule  with  random  spanning  trees.  Parameters  are  selected  to  maximize  the 
Bethe  likelihood  (3.6)  using  limited-memory  BFGS.  As  discussed  in  Section  3.1.4,  the 
gradient  of  this  objective  is  identical  to  that  of  the  true  likelihood,  except  that  where 
the  true  gradient  uses  the  true  marginal  distributions  of  the  model,  the  gradient  of 
the  Bethe  likelihood  uses  the  beliefs  resulting  from  BP. 

It  is  important  to  carefully  choose  the  initial  parameter  setting  for  the  optimiza¬ 
tion  procedure.  At  first,  this  may  seem  surprising,  because  the  likelihood  is  a  concave 
function  of  the  parameters,  and  so  gradient-based  optimization  should  find  the  global 
optimum  from  any  starting  point.  In  fact,  initialization  does  matter,  for  two  reasons. 
First,  even  though  the  true  likelihood  is  convex,  the  BP  approximation  to  the  likeli¬ 
hood  is  not,  so  it  is  possible  to  find  a  local  maximum  that  is  not  globally  optimal.  If 
this  happens,  it  means  that  the  final  parameter  setting  has  a  zero  BP  gradient  with 
respect  to  one  fixed  point,  but  not  with  respect  to  other  fixed  points. 

A  second  explanation,  and  perhaps  more  relevant,  is  that  many  parameter  settings 
have  likelihoods  that  are  numerically  very  close  to  optimal,  but  perform  differently  on 
unseen  data.  In  the  models  considered  in  this  thesis,  the  features  are  often  rich  enough 
that  it  is  possible  to  fit  the  training  data  arbitrarily  closely.  Furthermore,  because 
the  features  are  linearly  dependent,  there  are  many  different  lines  in  parameter  space 
that  approach  the  empirical  expectations  from  different  directions.  The  numerical 
optimization  algorithm  terminates  when  the  difference  in  value  or  gradient  becomes 
too  small,  so  approaching  the  maximum  from  different  directions  can  still  lead  to 
different  parameter  setting.  As  a  concrete  example,  consider  the  two-level  DCRF 
of  the  previous  section.  Using  only  the  within-chain  features,  it  is  possible  to  find 
solutions  with  likelihood  only  slightly  worse  than  the  full  model,  which  uses  both 
within-chain  and  between-chain  factors.  But  we  saw  in  the  experimental  results 
that  using  the  between-chain  factors  improves  accuracy  on  unseen  data.  So  this 
is  case  where  a  poor  initialization  of  the  optimization  procedure,  namely,  one  that 


encouraged  the  within-chain  factors  to  be  given  too  much  weight,  can  significantly 
degrade  accuracy. 

In  the  experiments  reported  in  the  next  section,  the  skip-chain  CRF  is  initialized 
from  a  linear-chain  CRF.  That  is,  the  linear-chain  factors  of  the  model  are  initialized 
from  the  weights  of  a  fully-trained  linear-chain  CRF,  and  the  long-distance  factors 
are  initialized  to  uniform.  If  we  instead  start  at  the  uniform  distribution — that  is,  by 
initializing  all  parameters  to  0 — not  only  does  loopy  BP  training  take  much  longer, 
but  testing  performance  is  much  worse,  because  the  convex  optimization  procedure 
has  difficulty  with  noisier  gradients.  With  uniform  initialization,  loopy  BP  does  not 
converge  for  all  training  instances,  especially  at  early  iterations  of  training.  That  is, 
carefully  initializing  the  parameters  avoids  regions  of  parameter  space  in  which  BP 
performs  poorly. 

3.2.3  Results 

We  evaluate  skip-chain  CRTs  on  the  seminar  announcements  data  set  discussed 
in  Section  3. 1.6. 3.  The  messages  are  annotated  with  the  seminar’s  starting  time, 
ending  time,  location,  and  speaker.  Often  the  fields  are  listed  multiple  times  in  the 
message.  For  example,  the  speaker  name  might  be  included  both  near  the  beginning 
and  later  on,  in  a  sentence  like  “If  you  would  like  to  meet  with  Professor  Smith. . .  ” 
As  mentioned  earlier,  it  can  be  useful  to  find  both  such  mentions,  because  different 
information  can  occur  in  the  surrounding  context  of  each  mention:  for  example,  the 
first  mention  might  be  near  an  institutional  affiliation,  while  the  second  mentions  that 
Smith  is  a  professor. 

We  evaluate  a  skip-chain  CRF  with  skip  edges  between  identical  capitalized  words. 
The  motivation  for  this  is  that  the  hardest  aspect  of  this  data  set  is  identifying 
speakers  and  locations,  and  capitalized  words  that  occur  multiple  times  in  a  seminar 
announcement  are  likely  to  be  either  speakers  or  locations. 


Table  3.4  shows  the  list  of  input  features  we  used.  For  a  skip  edge  (u,  v),  the  input 
features  we  used  were  the  disjunction  of  the  input  features  at  u  and  v,  that  is, 

q'k  (x,  u,  v)  =  qk  (x,  u)  ©  qk  (x,  v)  (3.16) 

where  ©  is  binary  or.  All  of  our  results  are  averaged  over  5-fold  cross-validation  with 
an  80/20  split  of  the  data.  We  report  results  from  both  a  linear-chain  CRF  and  a 
skip-chain  CRF  with  the  same  set  of  input  features. 

We  calculate  precision  and  recall  as5 


#  tokens  extracted  correctly 

#  tokens  extracted 

#  tokens  extracted  correctly 

#  true  tokens  of  field 

As  usual,  we  report  Fx  =  (2 PR)/(P  +  R). 

Table  3.6  compares  a  skip-chain  CRF  to  a  linear-chain  CRF  and  to  a  dynamic 
Bayes  net  used  in  previous  work  [93].  The  skip-chain  CRF  performs  much  better 
than  all  the  other  systems  on  the  SPEAKER  held,  which  is  the  held  for  which  the  skip 
edges  would  be  expected  to  make  the  most  difference.  On  the  other  helds,  however, 
the  skip-chain  CRF  does  slightly  worse  (less  than  1%  absolute  FI). 

We  expected  that  the  skip-chain  CRF  would  do  especially  well  on  the  speaker 
held,  because  speaker  names  tend  to  appear  multiple  times  in  a  document,  and  a 
skip-chain  CRF  can  learn  to  label  the  multiple  occurrences  consistently.  To  test  this 
hypothesis,  we  measure  the  number  of  inconsistently  mislabeled  tokens,  that  is,  tokens 

5Previous  work  on  this  data  set  has  traditionally  reported  precision  and  recall  only  at  the  docu¬ 
ment  level,  that  is,  from  each  document  the  system  extracts  only  one  field  of  each  type.  Because  the 
goal  of  the  skip-chain  CRF  is  to  extract  all  mentions  in  a  document,  these  metrics  are  inappropriate, 
so  we  cannot  compare  with  this  previous  work.  Peshkin  and  Pfeffer  [93]  do  use  the  per-token  metric 
(personal  communication),  so  our  comparison  is  fair  in  that  respect. 
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that  are  mislabeled  even  though  the  same  token  is  classified  correctly  elsewhere  in 
the  document.  Table  3.7  compares  the  number  of  inconsistently  mislabeled  tokens  in 
the  test  set  between  linear-chain  and  skip-chain  CRFs.  For  the  linear-chain  CRF,  on 
average  30.2  true  speaker  tokens  are  inconsistently  mislabeled.  Because  the  linear- 
chain  CRF  mislabels  121.6  true  speaker  tokens,  this  situation  includes  24.7%  of  the 
missed  speaker  tokens. 

The  skip-chain  CRF  shows  a  dramatic  decrease  in  inconsistently  mislabeled  tokens 
on  the  speaker  field,  from  30.2  tokens  to  4.8.  Consequently,  the  skip-chain  CRF  also 
has  much  better  recall  on  speaker  tokens  than  the  linear-chain  CRF  (70.0  R  linear 
chain,  76.8  R  skip  chain).  This  explains  the  increase  in  FI  from  linear-chain  to  skip- 
chain  CRFs,  because  the  two  have  similar  precision  (86.5  P  linear  chain,  85.1  skip 
chain).  These  results  support  the  original  hypothesis  that  treating  repeated  tokens 
consistently  especially  benefits  recall  on  the  SPEAKER  field. 

On  the  LOCATION  field,  on  the  other  hand,  where  we  might  also  expect  skip-chain 
CRFs  to  perform  better,  there  is  no  benefit.  We  explain  this  by  observing  in  Table  3.7 
that  inconsistent  misclassification  occurs  much  less  frequently  in  this  field. 

3.2.4  Related  Work 

Bunescu  and  Mooney  [14]  have  used  a  relational  Markov  network  to  collectively 
classify  the  mentions  in  a  document,  achieving  increased  accuracy  by  learning  de¬ 
pendencies  between  similar  mentions.  In  their  work,  candidate  phrases  are  extracted 
heuristically,  which  can  introduce  errors  if  a  true  entity  is  not  selected  as  a  candidate 
phrase.  Our  model  performs  collective  segmentation  and  labeling  simultaneously,  so 
that  the  system  can  take  into  account  dependencies  between  the  two  tasks. 

After  we  first  presented  the  skip-chain  CRF  [118],  several  other  authors  have 
introduced  interesting  extensions.  As  one  extension,  Finkel  et  al.  [32]  augment  the 
skip-chain  model  with  richer  kinds  of  long-distance  factors  than  just  over  pairs  of 
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words.  These  factors  are  useful  for  modeling  exceptions  to  the  assumption  that  similar 
words  tend  to  have  similar  labels.  For  example,  in  named-entity  recognition,  the  word 
China  is  as  a  place  name  when  it  appears  alone,  but  when  it  occurs  within  the  phrase 
The  China  Daily ,  it  should  be  labeled  as  a  organization.  Because  this  model  is  more 
complex  than  the  original  skip-chain  model,  Finkel  et  al.  estimate  its  parameters  in 
two  stages,  first  training  the  linear-chain  component  as  a  separate  CRF,  and  then 
heuristically  selecting  parameters  for  the  long-distance  factors.  Finkel  et  al.  report 
improved  results  both  on  the  seminars  data  set  that  we  consider  in  this  chapter,  and 
on  several  other  standard  information  extraction  data  sets. 

An  alternative  to  the  skip-chain  CRF,  Rosenberg  et  al.  [104]  propose  an  MEMM 
with  long-distance  edges.  This  results  in  some  nodes  having  many  parents,  so  in 
order  to  reduce  the  number  of  parameters,  every  conditional  probability  table  that 
has  multiple  parents  is  assumed  to  be  a  mixture  of  CPTs  involving  single  parents. 
This  mixture-of-parents  assumption  is  similar  to  the  restriction  to  pairwise  factors  in 
the  skip-chain  CRF.  The  potential  advantage  of  such  a  model  is  that  training  is  much 
simpler  and  more  computationally  efficient  than  the  skip-chain  CRF.  The  potential 
disadvantage  is  label  bias,  that  is,  that  observations  late  in  the  sequence  have  no 
effect  on  earlier  labels. 

3.3  Summary 

In  this  chapter,  I  have  presented  two  simple  loopy  extensions  to  linear-chain  CRFs: 
dynamic  CRFs  and  skip-chain  CRFs.  Dynamic  CRFs  are  conditionally-trained  undi¬ 
rected  sequence  models  with  repeated  graphical  structure  and  tied  parameters.  They 
combine  the  best  of  both  conditional  random  fields  and  the  widely  successful  dy¬ 
namic  Bayesian  networks  (DBNs).  DCRFs  address  difficulties  of  DBNs,  by  easily 
incorporating  arbitrary  overlapping  input  features,  and  of  previous  conditional  mod¬ 
els,  by  allowing  more  complex  dependence  between  labels.  Inference  in  DCRFs  can  be 
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done  using  approximate  methods,  and  training  can  be  done  by  maximum  a  posteriori 
estimation. 

Empirically,  we  have  shown  that  factorial  CRFs  can  be  used  to  jointly  perform 
several  labeling  tasks  at  once,  sharing  information  between  them.  Such  a  joint  model 
performs  better  than  a  model  that  does  the  individual  labeling  tasks  sequentially,  and 
has  potentially  many  practical  implications,  because  cascaded  models  are  ubiquitous 
in  NLP. 

The  skip-chain  CRF  segments  a  sequence  while  modeling  long-distance  dependen¬ 
cies  between  similar  tokens.  The  skip-chain  CRF  can  also  be  viewed  as  performing 
extraction  while  taking  into  account  a  simple  form  of  coreference  information,  since 
the  reason  that  identical  words  are  likely  to  have  similar  tags  is  that  they  are  likely 
to  be  coreferent.  Thus,  the  skip-chain  CRF,  like  the  FCRF,  can  be  viewed  as  a  step 
toward  joint  probabilistic  models  for  extraction  and  data  mining  as  advocated  by 
McCallum  and  Jensen  [69]. 
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CHAPTER  4 


PIECEWISE  TRAINING 


This  chapter  begins  our  investigation  of  approximate  methods  for  training  CRFs. 
One  attractive  family  of  approximate  training  methods  is  local  training  methods,  by 
which  f  mean  methods  that  depend  on  sums  of  local  functions  of  only  a  few  factors, 
such  as  the  conditional  probability  of  a  node  given  its  Markov  blanket,  rather  than  on 
global  functions  of  the  entire  graph,  such  the  likelihood.  The  best-known  example  of 
a  local  training  method  is  Besag’s  pseudolikelihood  [8],  which  is  a  product  of  per- node 
conditional  probabilities. 

In  this  chapter,  I  present  a  novel  local  training  method  called  piecewise  training ,  in 
which  the  model’s  factors  are  divided  into  possibly  overlapping  sets  of  pieces,  which 
are  each  trained  separately.  At  test  time,  the  resulting  weights  are  used  just  as  if 
they  had  been  trained  using  maximum  likelihood,  that  is,  on  the  unseen  data  they 
are  used  to  predict  the  labels  using  a  standard  approximate  inference  algorithm,  such 
as  max-product  BP.  When  using  piecewise  training,  the  modeler  must  decide  how  to 
split  the  model  into  pieces  before  training.  In  this  thesis  most  of  the  experiments  use 
the  factor-as-piece  approximation,  in  which  each  factor  of  the  model  is  placed  in  a 
separate  piece. 

This  training  procedure  can  be  viewed  in  two  ways.  First,  separate  training  of 
each  piece  can  be  accomplished  by  numerically  maximizing  an  approximation  to  the 
likelihood.  This  approximate  likelihood  can  be  seen  as  the  true  likelihood  on  a  trans¬ 
formation  of  the  original  graph,  which  I  call  the  node-split  graph,  in  which  each  of 
the  pieces  is  an  isolated  component.  The  second  view  is  based  on  belief  propagation; 
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namely,  the  objective  function  of  piecewise  training  is  the  same  as  the  BP  approx¬ 
imate  likelihood  (3.6)  with  uniform  messages,  as  if  BP  has  been  stopped  after  zero 
iterations.  I  call  this  the  pseudomarginal  view  of  piecewise  training,  for  reasons  ex¬ 
plained  in  Section  4.4.  These  two  viewpoints  will  prove  useful  both  for  understanding 
these  algorithms,  and  for  designing  the  extensions  in  this  chapter  and  the  next. 

In  this  chapter,  I  define  piecewise  training  (Section  4.1),  explaining  it  from  both 
the  local  graph  and  the  pseudomarginal  perspectives.  I  apply  the  factor-as-piece 
approximation  to  several  natural-language  data  sets.  The  model  resulting  from  the 
piecewise  approximation  has  better  accuracy  than  pseudolikelihood  and  is  sometimes 
comparable  to  exact  maximum  likelihood  (Section  4.3).  Then  I  consider  several  exten¬ 
sions  to  the  basic  method.  First,  I  consider  an  extension  called  reweighted  piecewise 
training,  based  on  a  connection  between  piecewise  training  and  the  upper  bounds  of 
Wainwright  et  al.  [140],  but  unfortunately  the  results  here  are  negative:  reweighted 
piecewise  training  has  worse  accuracy  than  standard  piecewise  on  our  data.  A  more 
interesting  family  of  extensions  is  based  on  the  connection  to  standard  belief  propa¬ 
gation.  To  develop  these,  I  introduce  three  views  of  approximate  training  algorithms, 
which  I  call  the  neighborhood  graph  view,  the  pseudomarginal  view,  and  the  belief 
view  (Section  4.4).  This  development  motivates  two  different  extensions  of  piecewise 
training,  namely  shared-unary  piecewise  (Section  4.5.1)  and  one-step  cutout  (Sec¬ 
tion  4.5.3).  Just  as  standard  piecewise  corresponds  to  zero  iterations  of  BP,  shared- 
unary  corresponds  to  1  iteration  and  one-step  cutout  to  2  iterations.  Simulated  data 
provides  illuminating  insight  into  when  shared-unary  piecewise  and  one-step  cutout 
may  be  more  appropriate  than  standard  piecewise  (Section  4.5.5). 

4.1  Definition 

In  this  section,  I  present  piecewise  training,  explaining  how  it  maximizes  a  loose 
lower  bound  on  the  likelihood.  The  motivation  is  that  in  some  applications,  the 
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local  information  in  each  factor  alone,  without  performing  inference,  is  enough  to 
do  fairly  well  at  predicting  the  outputs,  but  some  amount  of  global  information  can 
help.  Therefore,  to  reduce  training  time,  it  makes  sense  to  perform  less  inference 
at  training  time  than  at  test  time,  because  at  training  time  we  loop  through  the 
examples  repeatedly,  whereas  at  test  time  we  only  need  to  make  each  prediction 
once.  For  example,  suppose  we  want  to  train  a  loopy  pairwise  MRF.  In  piecewise 
estimation,  what  we  will  do  is  to  train  the  parameters  of  each  edge  independently, 
as  if  each  edge  were  a  separate  two- node  MRF  of  its  own.  Finally,  on  test  data,  the 
parameters  resulting  from  this  local  training  become  the  parameters  used  to  perform 
global  inference,  using  some  standard  approximate  inference  algorithm. 

Now  I  define  the  piecewise  estimator  more  generally.  Let  the  distribution  p(y)  be 
defined  by  a  factor  graph,  where  'Fa(ya,  9)  has  the  exponential  form  (2.3),  and  suppose 
that  we  wish  to  estimate  6.  (To  simplify  notation,  I  describe  piecewise  estimation 
for  generative  models;  the  conditional  case  is  exactly  analogous.)  I  assume  that  the 
model’s  factors  are  divided  into  a  set  V  =  {R0,Ri . . .}  of  pieces;  each  piece  R  G  V 
is  a  set  of  factors  R  =  {'Fa}.  The  pieces  need  not  be  disjoint.  For  example,  in  a 
grid-shaped  MRF  with  unary  and  pairwise  factors,  we  might  isolate  each  factor  in  its 
own  piece,  or  alternatively  we  might  choose  one  piece  for  each  row  and  each  column 
of  the  MRF,  in  which  case  each  unary  factor  would  be  shared  between  its  row  piece 
and  its  column  piece. 

To  train  the  pieces  separately,  each  piece  R  has  a  local  likelihood 

K 

M9)  =  E  E  -  Me)-  (4-1) 

a£R  k= 1 

where  Ar(9)  is  the  local  log  partition  function  for  the  piece,  that  is, 

K 

Ar{9)  =  log  exp{J]  9ak  fakfya )}j  (4.2) 

yR  aeR  k=  1 
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where  is  the  vector  of  variables  used  anywhere  in  piece  R.  This  is  the  likelihood 
for  the  piece  R  if  it  were  a  completely  separate  graphical  model.  If  the  pieces  are 
disjoint,  and  no  parameters  are  shared  between  distinct  factors,  then  we  could  train 
each  piece  by  separately  computing  parameters  9^w  =  max^  £r(9r).  But  in  fact  we 
would  like  to  handle  both  parameter  tying  and  overlapping  pieces.  To  do  this,  we 
instead  perform  a  single  optimization,  maximizing  the  sum  of  all  of  the  single-piece 
likelihoods.  So  for  a  set  V  of  pieces,  the  piecewise  likelihood  becomes 
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For  example,  consider  the  special  case  of  per-edge  pieces  in  a  pairwise  MRF  with  no 
tied  parameters.  Then,  for  an  edge  ( s,t ),  we  have  Ast(9 )  =  log^j)  ^/(ys,yt),  so 
that  the  piecewise  estimator  corresponds  exactly  to  training  independent  probabilistic 
classifiers  on  each  edge. 

Now  let  us  compare  the  approximate  likelihood  (4.3)  to  the  exact  likelihood.  Recall 
that  the  true  likelihood  is 


t(0)  =  Y,e*Myk)-A{e) 

k 

A(9)  =  log^exp{^dfc0fc(yfc)}. 
y  k 

Notice  that  the  first  summation  contains  exactly  the  same  terms  as  in  the  exact  like¬ 
lihood.  The  only  difference  between  the  piecewise  objective  and  the  exact  likelihood 
is  in  the  second  summation  of  (4.3).  So  APW(9)  =  J2rAr(9)  can  be  viewed  as  an 
approximation  of  the  log  partition  function. 

A  choice  of  pieces  to  which  I  devote  particular  attention  is  the  factor-as-piece 
approximation,  in  which  each  factor  in  the  model  is  assigned  to  its  own  piece.  There 
is  a  potential  ambiguity  in  this  choice,  because  recall  that  the  factors  take  the  form 
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^a(ya)  =  eXP{^  @akfak  (ya)  }  j 
k 

so  that  each  factor  has  multiple  parameters  and  sufficient  statistics.  But  we  could 
just  as  well  place  each  sufficient  statistic  in  a  factor  all  to  itself,  that  is, 

'f'afc(ya)  =  exp{6U/afc(ya)},  (4.4) 

and  define  the  pieces  at  that  level  of  granularity.  Such  a  fine-grained  choice  of  pieces 
could  be  useful.  For  example,  in  a  linear-chain  model,  we  might  choose  to  view  the 
model  as  a  weighted  finite-state  machine,  and  partition  the  state-transition  diagram 
into  pieces.  For  the  purposes  of  this  thesis,  however,  when  I  use  the  factor-as-piece 
approximation,  I  will  not  use  the  fine-grained  factorization  of  (4.4),  that  is,  I  will 
assume  that  the  graph  has  been  constructed  so  that  no  two  factors  share  exactly  the 
same  support. 

Apart  from  its  intuitive  plausibility,  another  rationale  for  the  piecewise  estimator 
is  provided  by  the  following  proposition: 

Proposition  4.1.  For  any  set  V  of  pieces,  the  piecewise  partition  function  is  an 
upper  bound  on  the  true  partition  function: 

A(6)  <  ^  AR(ff).  (4.5) 

Rev 


Proof.  The  bound  is  immediate  upon  expansion  of  A(9). 


A(9)  =  log  Y  exP  |  Y  r 

X  l  a  ) 

=  log  eit  6Xp  ^  ^  @ a$a(?^-ci 


x  RgR  k  aGR 


<  log  ew 

ReV  xfl  ta£-R  J 

=  E  -cv) 


Rep 


(4.6) 

(4.7) 

(4.8) 

(4.9) 


The  bound  from  (4.7)  to  (4.8)  is  justified  by  considering  the  expansion  of  the  product 
in  (4.8).  The  expansion  contains  every  term  of  the  summation  in  (4.7),  and  all  terms 
are  nonnegative.  □ 

Therefore,  the  piecewise  likelihood  is  a  lower  bound  on  the  true  likelihood.  If  the 
graph  is  connected,  however,  then  the  bound  is  nowhere  tight. 

Although  so  far  I  have  been  using  the  notation  of  generative  models  for  simplic¬ 
ity,  estimation  is  especially  well-suited  for  conditional  random  fields.  As  mentioned 
earlier,  standard  maximum-likelihood  training  for  CRFs  can  require  evaluating  the 
instance-specific  partition  function  Z(x)  for  each  training  instance  for  each  iteration 
of  an  optimization  algorithm,  which  can  be  expensive  even  for  linear  chains.  By  using 
piecewise  training,  we  need  to  compute  only  local  normalization  over  small  cliques, 
which  for  loopy  graphs  is  potentially  much  more  efficient. 


4.1.1  The  Node-split  Graph 

The  piecewise  likelihood  (4.3)  can  be  viewed  as  the  exact  likelihood  in  a  trans¬ 
formation  of  the  original  graph.  In  the  transformed  graph,  we  split  the  variables, 
adding  one  copy  of  each  variable  for  each  factor  that  it  participates  in,  as  pictured  in 
Figure  4.1.  We  call  the  transformed  graph  the  node- split  graph. 

Formally,  the  splitting  transformation  is  as  follows.  Given  a  factor  graph  G,  create 
a  new  graph  G'  with  variables  { yas },  where  a  ranges  over  all  factors  in  G  and  s  over 
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Figure  4.1.  Example  of  node  splitting.  Left  is  the  original  model,  right  is  the  version 
trained  by  piecewise.  In  this  example,  there  are  no  unary  factors. 


all  variables  in  a.  For  any  factor  a,  let  na  map  variables  in  G  to  their  copy  in  G',  that 
is,  7 Ta(ys)  =  Has  for  any  variable  s  in  G.  Finally,  for  each  factor  H /a(ya,  6)  in  G,  add  a 
factor  xIJ'a  to  G'  as 

y'a{na(ya),  6)  =  ^a(ya,  9).  (4.10) 

If  we  wish  to  use  pieces  that  are  larger  than  a  single  factor,  then  the  definition  of  the 
node-split  graph  can  be  modified  accordingly. 

Clearly,  piecewise  training  in  the  original  graph  is  equivalent  to  exact  maximum 
likelihood  training  in  the  node-split  graph.  This  view  of  piecewise  training  will  prove 
useful  in  the  extensions  presented  in  Section  4.5  and  in  Chapter  5. 

4.1.2  The  Belief  Propagation  Viewpoint 

Another  way  of  understanding  piecewise  training  arises  from  belief  propagation. 
Let  M  =  {mai(yi)}  be  a  set  of  BP  messages,  not  necessarily  converged.  We  view 
all  of  a  factor’s  outgoing  messages  as  approximating  it,  that  is,  we  define  Ta  = 
ni6amai.  Recall  from  Section  2.1.4  that  the  dual  energy  of  belief  propagation  yields 
an  approximate  partition  function 

ZBP(d,M)  =  II  ’  (4-n) 

«  \y«  Wa(.yJ  /  i  \  Vi  ) 


where  q  denotes  the  unnormalized  beliefs 
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(4.12) 


<j(y> = n  *«(y«) =nrim“W' 

a  a  i 

with  q(ya)  =  Ey\ya  q(y)  and  qfa)  =  £y\y.  ?(y)- 

Now,  let  M0  be  the  uniform  message  setting,  that  is,  rnai  =  1  for  all  a  and  i.  This 
is  a  common  initialization  for  BP.  Then  the  unnormalized  beliefs  are  q( y)  =  1  for  all 
y,  and  the  approximate  partition  function  is 

Zbp(»,M„)  =  J]  He}-*,  (4.13) 

where  Ca  and  Ct  are  constants  that  do  not  depend  on  6.  This  approximate  partition 
function  is  the  same  as  that  used  by  piecewise  training  with  one  factor  per  piece,  up 
to  a  multiplicative  constant  that  does  not  change  the  gradient.  So  another  view  is 
that  piecewise  training  approximates  the  likelihood  using  belief  propagation,  except 
that  we  cut  off  BP  after  0  iterations.  This  view  informs  the  training  methods  that  I 
introduce  in  Section  4.5  and  in  Chapter  6. 

4.1.3  Pseudo-Moment  Matching  Viewpoint 

Piecewise  training  is  based  on  the  intuition  that  if  all  of  the  local  factors  fit  the  data 
well,  then  the  resulting  global  distribution  is  likely  to  be  reasonable.  An  interesting 
way  of  formalizing  this  idea  is  by  way  of  the  pseudo-moment  matching  estimator  of 
Wainwright  et  al.  [142],  In  this  section,  I  show  that  there  is  a  sense  in  which  piecewise 
training  can  be  viewed  as  an  extension  of  the  pseudo-moment  matching  estimator. 

First,  consider  the  case  in  which  p(y)  factorizes  according  to  a  graph  G  with 
fully-parameterized  tables,  that  is, 

'ba(ya)  =  exp{^  d{y'a)l{ya=ya}}  (4.14) 

y'a 

Let  p(y)  be  the  empirical  distribution,  that  is,  p(y)  oc  JA  1  ry=yW\ . 
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The  pseudo-moment  matching  estimator  chooses  parameters  that  maximize  the 
BP  likelihood  (3.6)  without  actually  computing  any  of  the  message  updates.  This 
estimator  is 


0a{  ya) 


log 


p(ya) 
n  seaPiVs) 


0s{ys)  =  log  p(y8). 


(4.15) 


For  these  parameters,  there  exists  a  set  of  messages  that  (a)  are  a  fixed-point  of  BP, 
and  (b)  the  resulting  beliefs  qa  and  qs  equal  the  empirical  marginals.  This  can  be 
seen  using  the  reparameterization  perspective  of  BP  [141]  described  in  Section  2.1.4, 
because  with  those  parameters  the  belief-based  updates  of  (2.26)  yield  a  fixed  point 
immediately. 

In  this  thesis,  however,  we  are  interested  in  estimating  the  parameters  of  condi¬ 
tional  distributions  p(y|x).  A  simple  generalization  is  to  require  for  all  inputs  x  with 
p(x)  >  0  that 


^(y«,x) 


P(Va  lX) 

EUa.p(^Ix) 


Ts(y„x)  =p(ys  |x). 


(4.16) 


However,  we  can  no  longer  expect  to  find  parameters  that  satisfy  these  equations  in 
closed  form.  This  is  because  the  factor  values  of  4/a(-,x)  ho  not  have  an  independent 
degree  of  freedom  for  each  input  value  x.  Instead,  to  promote  generalization  across 
different  inputs,  the  factors  have  some  complex  parameterization,  such  as  the 
linear  form  (2.3),  in  which  parameters  are  tied  across  different  input  values.  A  more 
useful  generalization  is  to  treat  the  equations  (4.16)  as  a  nonlinear  set  of  equations 
to  be  solved.  To  do  this,  we  optimize  the  objective  function 


a  s 


(4.17) 
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where  Z)(- 1| •)  is  a  divergence  measure.  By  a  divergence  measure  D(p\\q),  I  simply 
mean  a  nonnegative  function  that  is  0  if  and  only  if  p  —  q.  Then  if  a  parameter 
setting  6  exists  such  that  the  divergence  is  zero,  then  the  equations  have  been  solved 
exactly,  and  0  optimizes  the  BP  likelihood. 

This  provides  another  view  of  piecewise  training,  because  choosing  using  KL(pa||Ta) 
for  the  divergence  in  (4.17)  yields  an  equivalent  optimization  problem  to  the  piecewise 
likelihood  (4.3).  This  provides  a  justification  of  the  intuition  that  fitting  locally  can 
lead  to  a  reasonable  global  solution:  it  is  not  the  case  that  fitting  factors  locally  causes 
the  true  marginals  to  be  matched  to  the  empirical  distribution,  but  it  does  cause  the 
BP  approximation  to  the  marginals  to  be  matched  to  the  empirical  distribution. 

4.2  Reweighted  Piecewise  Training 

In  this  section,  I  sketch  another  proof  of  Proposition  4.1,  deriving  it  from  the  tree- 
reweighted  bounds  of  Wainwright,  Jaakkola,  and  Willsky  [140],  a  connection  which 
suggests  generalizations  of  the  simple  piecewise  training  procedure.  To  simplify  the 
exposition,  in  this  section  I  assume  the  factor-as-piece  approximation,  but  the  ideas 
extend  readily  to  more  general  disjoint  pieces. 

4.2.1  Tree-Reweighted  Upper  Bounds 

Wainwright,  Jaakkola,  and  Willsky  [140]  introduce  a  class  of  upper  bounds  on  A(6) 
that  arise  immediately  from  its  convexity.  The  basic  idea  is  to  write  the  parameter 
vector  9  as  a  mixture  of  parameter  vectors  of  tractable  distributions,  and  then  apply 
Jensen’s  inequality. 

Let  T  =  {Tf{}  be  a  set  of  tractable  subgraphs  of  Q.  For  concreteness,  think  of 
T  as  the  set  of  all  spanning  trees  of  Q ;  this  is  in  fact  the  case  to  which  Wainwright, 
Jaakkola,  and  Willsky  devote  their  attention.  For  each  tractable  graph  7r,  let  9(Tr) 
be  an  exponential  parameter  vector  that  has  the  same  dimensionality  as  6,  but  respects 
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the  structure  of  Tr.  More  formally,  this  means  that  the  entries  of  9(Tr)  must  be  zero 
for  factors  that  do  not  appear  in  Tr.  Except  for  this,  9(Tr)  is  arbitrary;  there  is  no 
requirement  that  on  its  own,  it  matches  9  in  any  way. 

Suppose  we  also  have  a  distribution  /j  =  {hr\Tr  e  T}  over  the  tractable  sub¬ 
graphs,  such  that  the  original  parameter  vector  9  can  be  written  as  a  combination  of 
the  per-tree  parameter  vectors: 


9  =  hr9(Tr).  (4.18) 

Tr&T 

In  other  words,  we  have  written  the  original  parameters  9  as  a  mixture  of  parameters 
on  tractable  subgraphs. 

Then  the  upper  bound  on  the  log  partition  function  A(6)  arises  directly  from 
Jensen’s  inequality: 

A{9)  =  a(J2  <  E  »RA(0(Tr))-  (4.19) 

\TReT  /  Tr&t 

Because  we  have  required  that  each  graph  T  be  tractable,  each  term  on  the  right- 
hand  side  of  (4.19)  can  be  computed  efficiently.  If  the  size  of  T  is  large,  however, 
then  computing  the  sum  is  still  intractable.  We  deal  with  this  issue  next. 

A  natural  question  about  this  bound  is  how  to  select  9  so  as  to  get  the  tightest 
upper  bound  possible.  For  fixed  /i,  the  optimization  over  9  can  be  cast  as  a  convex 
optimization  problem: 


min  S']  hrA(9(Tr ))  (4.20) 

0  *  ^ 

Tr&T 

s.t.  9  =  Hr9(Tr).  (4.21) 

Tr&t 

Bnt  this  optimization  problem  can  have  astronomically  many  parameters,  especially 
if  T  is  the  set  of  all  spanning  trees.  The  number  of  constraints,  however,  is  much 
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smaller,  because  the  constraints  are  just  one  equality  constraint  for  each  element  of  9. 
To  collapse  the  dimensionality  of  the  optimization  problem,  therefore,  Wainwright, 
Jaakkola,  and  Willsky  use  the  Lagrange  dual  of  (4.20),  which  can  then  be  optimized 
using  either  standard  optimization  techniques,  or  a  message  passing  algorithm  similar 
to  to  BP.  For  our  present  purposes,  however,  it  suffices  to  consider  only  the  primal 
problem  in  (4.20),  which  we  use  in  the  next  section  as  a  alternative  derivation  of 
piecewise  bounds. 

4.2.2  Application  to  Piecewise  Upper  Bounds 

Now  we  discuss  how  the  tree-reweighted  upper  bounds  can  be  applied  to  piecewise 
training.  As  in  the  previous  section,  we  will  obtain  an  upper  bound  by  writing  the 
original  parameters  9  as  a  mixture  of  tractable  parameter  vectors  9{T).  Consider  the 
set  T  of  tractable  subgraphs  induced  by  single  edges  of  Q.  Precisely,  for  each  factor 
fa  in  Q,  we  add  a  (non-spanning)  tree  TR  which  contains  only  the  factor  fa  and  its 
associated  variables.  With  each  tree  TR  we  associate  an  exponential  parameter  vector 
9{Tr). 

Let  g  be  a  strictly  positive  probability  distribution  over  factors.  To  use  Jensen’s 
inequality,  we  will  need  to  have  the  constraint 


0  =  5>fl0(Tfl).  (4.22) 

R 

Now,  each  parameter  9t  corresponds  to  exactly  one  factor  of  Q,  which  appears  in  only 
one  of  the  TR.  Therefore,  only  one  choice  of  subgraph  parameter  vectors  {9(TR)} 
meets  the  constraint  (4.22),  namely: 

0(Tr)  =  (4.23) 

hr? 

where  9\R  is  the  restriction  of  9  to  R ;  that  is,  9\R  has  the  same  entries  and  dimen¬ 
sionality  as  9,  but  with  zeros  in  all  entries  that  are  not  included  in  the  piece  R. 
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Method 

Overall  FI 

Piecewise 

91.2 

Pseudolikelihood 

84.7 

Per-edge  PL 

89.7 

Exact 

90.6 

Table  4.1.  Comparison  of  piecewise  training  to  exact  and  pseudolikehood  training  on 
a  linear-chain  CRF  for  named-entity  recognition.  On  this  tractable  model,  piecewise 
methods  are  more  accurate  than  pseudolikelihood,  and  just  as  accurate  as  exact 
training. 


Therefore,  using  Jensen’s  inequality,  we  immediately  have  the  bound 

m<J2»KA(— V  <4-24> 

This  reweighted  piecewise  bound  is  clearly  related  to  the  basic  piecewise  bound 
in  (4.5),  because  A(6\R)  differs  from  Ar(6)  only  by  an  additive  constant  which  is 
independent  of  9.  In  fact,  a  version  of  Proposition  4.1  can  be  derived  by  considering 
the  limit  of  (4.24)  as  p,  approaches  a  point  mass  on  an  arbitrary  single  piece  R*. 

The  connection  to  the  Wainwright  et  al.  work  suggests  at  least  two  generalizations 
of  the  basic  piecewise  method.  The  first  is  that  the  reweighted  piecewise  bound 
in  (4.24)  can  itself  be  minimized  as  an  approximation  to  A(9),  yielding  a  variation  of 
the  basic  piecewise  method. 

The  second  is  that  this  line  of  analysis  can  naturally  handle  the  case  when  pieces 
overlap.  For  example,  in  an  MRF  with  both  node  and  edge  factors,  we  might  choose 
each  piece  to  be  an  edge  factor  with  its  corresponding  node  factors,  hoping  that  this 
overlap  will  allow  limited  communication  between  the  pieces  which  could  improve  the 
approximation.  As  long  as  there  is  a  value  of  p  for  which  the  constraint  in  (4.23) 
holds,  then  (4.24)  provides  a  bound  we  can  minimize  in  an  overlapping  piecewise 
approximation. 
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Method 

Noun-phrase  FI 

Piecewise 

88.1 

Pseudolikclihood 

84.9 

Per-edge  PL 

86.5 

BP 

86.0 

Table  4.2.  Comparison  of  piecewise  training  to  other  methods  on  a  two-level  factorial 
CRF  for  joint  part-of-speech  tagging  and  noun-phrase  segmentation. 


Method 

Token  FI 
location  speaker 

Piecewise 

87.7 

75.4 

Pseudolikclihood 

67.1 

25.5 

Per-edge  PL 

76.9 

69.3 

BP 

86.6 

78.2 

Table  4.3.  Comparison  of  piecewise  training  to  other  methods  on  a  skip-chain  CRF 
for  seminar  announcements. 


4.3  Experiments 

The  bound  in  (4.5)  is  not  tight.  Because  the  bound  does  not  necessarily  touch  the 
true  likelihood  at  any  point,  maximizing  it  is  not  guaranteed  to  maximize  the  true 
likelihood.  We  turn  to  experiments  to  compare  the  accuracy  of  piecewise  training  both 
to  exact  estimation,  and  to  other  approximate  estimators.  A  particularly  interesting 
comparison  is  to  pseudolikclihood,  because  it  is  a  related  local  estimation  method. 

On  three  real-world  natural  language  tasks,  we  compare  piecewise  training  to 
exact  ML  training,  to  approximate  ML  training  using  belief  propagation,  and  to 
pseudolikclihood  training.  To  be  as  fair  as  possible,  we  compare  to  two  variations 
of  pseudolikclihood,  one  based  on  nodes  and  a  structured  version  based  on  edges. 
Pseudolikclihood  in  a  generative  model  is  normally  defined  as  [8]: 

4.(0)  =  log  JJp(ys|yjv(s)),  (4-25) 

S 


107 


where  N  (s)  are  the  set  of  variables  that  neighbor  variable  s.  This  per- variable  pseudo- 
likelihood  function  does  not  work  well  for  sequence  labeling,  because  it  does  not  take 
into  account  strong  interactions  between  neighboring  sequence  positions.  In  order  to 
have  a  stronger  baseline,  we  also  compare  to  a  per-edge  version  of  pseudolikelihood: 

C  ,(0)  =  logJJp(j/s,j/t|yAT(Slt))»  (4-26) 

st 

that  is,  instead  of  using  the  conditional  distribution  of  each  node,  we  use  each  edge, 
hoping  to  take  more  of  the  sequential  interactions  into  account. 

We  evaluate  piecewise  training  on  three  models:  a  linear-chain  CRF  (Section  2.3), 
a  factorial  CRF  (Section  3.1),  and  a  skip-chain  CRF  (Section  3.2).  All  of  these 
models  use  a  large  number  of  input  features  such  as  word  identity,  part-of-speech 
tags,  capitalization,  and  membership  in  domain-specific  lexicons. 

In  all  the  experiments  below,  we  optimize  £PW  using  limited- memory  BFGS.  We 
use  a  Gaussian  prior  on  weights  to  avoid  overfitting.  In  previous  work,  the  prior 
parameter  had  been  tuned  on  each  data  set  for  belief  propagation,  and  for  the  local 
models  we  used  the  same  prior  parameter  without  change.  At  test  time,  decoding  is 
always  performed  using  max-product  belief  propagation. 

4.3.1  Linear-Chain  CRF 

First,  we  evaluate  the  accuracy  of  piecewise  training  on  a  tractable  model,  so  that 
we  can  compare  the  accuracy  to  exact  maximum-likelihood  training.  The  task  is 
named-entity  recognition,  that  is,  to  find  proper  nouns  in  text.  We  use  the  CoNLL 
2003  data  set,  consisting  of  14,987  newswire  sentences  annotated  with  names  of  peo¬ 
ple,  organizations,  locations,  and  miscellaneous  entities.  We  test  on  the  standard 
development  set  of  3,466  sentences.  Evaluation  is  done  using  precision  and  recall  on 
the  extracted  chunks,  and  we  report  F\  =  2 PR/ P  +  R.  We  use  a  linear-chain  CRF, 
whose  features  are  described  in  Table  4.4. 


108 


wt  =  w 

Wt  matches  [A-Z]  [a-z]  + 

Wt  matches  [A-Z]  [A-Z]  + 

Wt  matches  [A-Z] 
wt  matches  [A-Z]  + 
wt  contains  a  dash 

Wt  matches  [A-Z]  +  [a-z]  +  [A-Z]  +  [a-z] 

The  character  sequence  cq  . . .  cn  is  a  prefix  of  wt  (where  n  £  [0, 4]) 
The  character  sequence  cq  . . .  cn  is  a  suffix  of  wt  (where  n  £  [0,4]) 
The  character  sequence  cq  . . .  cn  occurs  in  wt  (where  n  £  [0, 4]) 
wt  appears  in  list  of  first  names, 

last  names,  countries,  locations,  honorifics,  etc. 
gfc(x,  t  +  d)  for  all  k  and  5  £  [—2,  2] 


Table  4.4.  Input  features  g*,(x,  t)  for  the  CoNLL  named-entity  data.  In  the  above 
wt  is  the  word  at  position  t,  Tt  is  the  POS  tag  at  position  t,  w  ranges  over  all  words 
in  the  training  data,  and  T  ranges  over  all  Penn  Treebank  part-of-speech  tags.  The 
“appears  to  be”  features  are  based  on  hand-designed  regular  expressions  that  can 
span  several  tokens. 


Piecewise  training  performs  better  than  either  of  the  pseudolikclihood  methods. 
Even  though  it  is  a  completely  local  training  methods,  piecewise  training  performs 
comparably  to  exact  CRF  training. 

Now,  in  a  linear-chain  model,  piecewise  training  has  the  same  computational  com¬ 
plexity  as  exact  CRF  training,  so  I  do  not  mean  to  advocate  the  piecewise  approx¬ 
imation  for  linear-chain  graphs.  Rather,  that  the  piecewise  approximation  loses  no 
accuracy  on  the  linear-chain  model  is  encouraging  when  we  turn  to  loopy  models, 
which  we  do  next. 

4.3.2  Factorial  CRF 

The  first  loopy  model  we  consider  is  the  factorial  CRF  introduced  in  Section  3.1. 
As  in  Chapter  3,  we  consider  here  the  task  of  jointly  predicting  part-of-speech  tags 
and  segmenting  noun  phrases  on  the  CoNLL  2000  data  set.  We  report  results  here  on 
subsets  of  223  training  sentences,  and  the  standard  test  set  of  2012  sentences.  Results 
are  averaged  over  5  different  random  subsets.  There  are  45  different  POS  labels,  and 
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the  three  NP  labels.  We  report  FI  on  noun-phrase  chunks.  The  features  used  are 
described  in  Table  3.2. 

In  previous  work,  this  model  was  optimized  by  approximating  the  partition  func¬ 
tion  using  belief  optimization,  but  this  was  quite  expensive.  Training  on  the  full  data 
set  of  8936  sentences  required  about  12  days  of  CPU  time.1 

Results  on  this  loopy  data  set  are  presented  in  Table  4.2.  Again,  the  piece- 
wise  estimator  performs  better  both  than  either  version  of  pseudolikelihood  and  than 
maximum-likelihood  estimation  using  belief  propagation.  On  this  small  subset,  ap¬ 
proximate  ML  training  with  BP  requires  1.8  h,  but  piecewise  training  is  still  twice  as 
fast,  using  0.83  h. 

4.3.3  Skip-chain  CRF 

Finally,  we  consider  a  model  with  many  irregular  loops,  which  is  the  skip  chain 
model  introduced  in  Section  3.2.  This  model  incorporates  certain  long-distance  de¬ 
pendencies  between  word  labels  into  a  linear-chain  model  for  information  extraction. 
We  use  the  seminar  extraction  data  set  described  in  that  section.  Consistently  with 
the  previous  work  on  this  data  set,  we  use  10-fold  cross  validation  with  a  50/50  train¬ 
ing/test  split.  We  report  per-token  FI  on  the  speaker  and  location  fields,  the  most 
difficult  of  the  four  fields.  The  features  used  are  described  in  Table  3.4.  Most  docu¬ 
ments  contain  many  crossing  skip-edges,  so  that  exact  maximum-likelihood  training 
using  junction  tree  is  completely  infeasible,  so  instead  we  compare  to  approximate 
training  using  loopy  belief  propagation. 

lrThis  number  should  be  taken  with  some  skepticism,  however,  because  it  results  from  experiments 
using  hardware  and  JVMs  from  3  years  ago,  which  are  several  times  slower  than  those  of  today.  Also, 
I  have  been  developing  the  inference  code  continuously  since  then,  so  it  is  quite  possible  that  the 
speed  of  my  implementation  has  improved  since  then  as  well. 
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Figure  4.2.  Schematic  factor-graph  depiction  of  the  difference  between  pseudolikeli¬ 
hood  (top)  and  piecewise  training  (bottom).  Each  term  in  pseudolikclihood  normal¬ 
izes  the  product  of  many  factors  (as  circled),  while  piecewise  training  normalizes  over 
one  factor  at  a  time. 


Results  on  this  model  are  given  in  Table  4.3.  Pseudolikclihood  performs  partic¬ 
ularly  poorly  on  this  model.  Piecewise  estimation  performs  much  better,  but  worse 
than  approximate  training  using  BP. 

Piecewise  training  is  faster  than  loopy  BP:  in  our  implementation  piecewise  train¬ 
ing  used  on  average  3.5  hr,  while  loopy  BP  used  6.8  hr.  To  get  these  loopy  BP 
results,  however,  we  must  carefully  initialize  the  training  procedure,  as  discussed  in 
Section  3.2.2.  For  example,  if  instead  we  initialize  the  model  to  the  uniform  distri¬ 
bution,  not  only  does  loopy  BP  training  take  much  longer,  over  10  hours,  but  testing 
performance  is  much  worse,  because  the  convex  optimization  procedure  has  difficulty 
with  noisier  gradients.  With  uniform  initialization,  loopy  BP  does  not  converge  for 
all  training  instances,  especially  at  early  iterations  of  training.  Carefully  initializing 
the  model  parameters  seems  to  alleviate  these  issues,  but  this  model-specific  tweaking 
was  unnecessary  for  piecewise  training. 


Ill 


Model 

Basic 

Reweighted 

Linear-chain 

91.2 

90.4 

FCRF 

88.1 

86.4 

Skip-chain  (location) 

87.7 

75.5 

Skip-chain  (speaker) 

75.4 

69.2 

Table  4.5.  Comparison  of  basic  piecewise  training  to  reweighted  piecewise  bound 
with  uniform  /j. 


4.3.4  Reweighted  Piecewise  Training 

We  also  evaluate  a  reweighted  piecewise  training,  a  modification  to  the  basic 
piecewise  estimator  discussed  in  Section  4.2,  in  which  the  pieces  are  weighted  by  a 
convex  combination.  The  performance  of  reweighted  piecewise  training  with  uniform 
Hr  is  presented  in  Table  4.5.  In  all  cases,  the  reweighted  piecewise  method  performs 
worse  than  the  basic  piecewise  method.  What  seems  be  happening  is  that  in  each  of 
these  models,  there  are  several  hundred  edges,  so  that  the  weight  hr  for  each  region 
is  rather  small,  perhaps  around  0.01.  For  each  piece  R ,  reweighted  bound  includes 
a  term  A  (9 \r/hr).  If  Hr  is  around  0.01,  then  this  means  that  we  multiply  the  log 
factor  values  by  100  before  evaluating  A.  This  multiplier  is  so  extreme  that  the  term 
A(9\r/ hr)  is  dominated  by  maximum- value  weight  in  0\R. 

4.4  Three  Views  of  Approximate  Training  Algorithms 

In  this  section,2  we  explain  the  subgraph  and  BP  viewpoints  on  local  training 
algorithms.  First,  many  local  training  algorithms  are  straightforwardly  viewed  as 
performing  exact  inference  on  a  transformed  graph  that  cuts  the  global  dependencies 
in  the  model.  For  example,  standard  piecewise  performs  maximum-likelihood  training 
in  a  node-split  graph  in  which  variables  are  duplicated  so  that  each  factor  is  in  its  own 

2This  section  and  the  next  originally  appeared  as  the  technical  report  Sutton  and  Minka  [124], 
as  explained  in  the  Acknowledgments  at  the  end  of  the  chapter. 
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connected  component.  We  refer  to  this  viewpoint  as  the  neighborhood  graph  view  of 
a  training  algorithm. 

Second,  many  local  training  algorithms  can  be  interpreted  as  approximating  log  Z 
by  the  Bethe  energy  log  ZBP  at  a  particular  message  setting.  We  call  this  the  pseudo¬ 
marginal  view,  because  under  this  view,  the  estimated  parameters  are  chosen  to  match 
the  pseudomarginals  to  the  empirical  marginals.  For  any  approximate  partition  func¬ 
tion  Z,  the  pseudomarginals  are  the  derivatives  <9  log  Z/dOak-  To  explain  the  termi¬ 
nology,  suppose  that  the  unary  factors  have  the  form  \E ^{yf)  =  exp{J^  ,  9iy'  1  r  =  ,1  }. 
Then  the  derivative  d  log  Z  /  86  fy.f)  of  the  true  partition  function  yields  the  marginal 
distribution,  so  the  corresponding  derivative  of  log  Z  is  called  a  pseudomarginal. 

As  a  third  viewpoint,  any  approximate  inference  algorithm  can  be  used  to  per¬ 
form  approximate  ML  training,  by  substituting  the  approximate  beliefs  for  the  exact 
marginals  in  the  ML  gradient.  We  call  this  the  belief  view  of  an  approximate  train¬ 
ing  algorithm.  For  example,  this  is  the  standard  way  of  implementing  approximate 
training  using  BP.  Interestingly,  although  every  approximate  likelihood  yields  ap¬ 
proximate  gradients  through  the  pseudomarginals,  not  all  approximate  gradients  can 
themselves  be  obtained  as  the  exact  gradient  of  any  single  approximate  objective 
function.  An  example  of  an  approximate  gradient  that  has  no  objective  function  is 
the  one-step  cutout  method  (Section  4.5.3).  Recently,  training  methods  that  have  a 
pseudomarginal  interpretation — that  is,  those  that  can  be  described  as  numerically 
optimizing  an  objective  function — have  received  much  attention,  but  it  is  not  clear  if 
training  methods  that  have  a  pseudomarginal  interpretation  should  be  preferred  over 
ones  that  do  not. 

The  pseudo  marginal  and  belief  viewpoints  are  distinct.  To  explain  this,  we  need 
to  make  a  distinction  that  is  not  always  clear  in  the  literature,  between  beliefs  and 
pseudomarginals.  By  the  belief  of  a  node  i,  we  mean  its  normalized  product  of 
messages,  which  is  proportial  to  q(yt)-  By  pseudomarginal,  on  the  other  hand,  we 
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mean  the  derivative  of  log  Z  with  respect  to  9%.  These  quantities  are  distinct.  For 
example,  in  standard  piecewise,  the  pseudomarginal  <9 log  ZPW/d9i(yi)  equals  Pi(yi), 
but  the  belief  is  proportional  to  q(jji)  =  '^2y\yi  q{Vi)  —  1- 

This  point  may  be  confusing  for  several  reasons.  First,  when  the  messages  m  are  a 
fixed  point  of  BP,  then  the  pseudomarginal  always  equals  the  belief.  But  this  does  not 
hold  before  convergence,  and  we  shall  be  mainly  concerned  with  intermediate  message 
settings,  before  BP  has  converged.  A  second  potential  confusion  arises  because  we 
define  ZBP  using  the  dual  Bethe  energy  [81]  rather  than  the  primal.  In  the  primal 
Bethe  energy  [150],  the  pseudomarginal  equals  the  belief  at  all  message  settings,  but 
this  is  not  true  of  the  dual  energy.  We  use  the  dual  energy  rather  than  the  primal 
not  only  because  it  helps  in  interpreting  local  training  algorithms,  but  also  because 
at  intermediate  message  settings  it  tends  to  be  a  better  approximation  to  log  Z. 

When  calculating  pseudomarginals  d  log  Z /d6ll  we  must  recognize  that  the  mes¬ 
sage  setting  is  often  itself  a  function  of  9.  For  example,  suppose  we  stop  BP  after  one 
iteration,  that  is,  we  take  Z{9)  =  ZBP(9,m^1\9)),  where  m ^  are  the  messages  after 
one  BP  iteration.  Then,  because  the  message  setting  is  clearly  a  function  of  9 ,  we 
need  to  take  dm^  / dOi  into  account  when  computing  the  pseudomarginals  of  Z. 

4.5  Extensions  from  BP  Early  Stopping 

If  piecewise  training  can  be  seen  as  running  zero  iterations  of  BP,  it  is  natural 
to  ask  whether  one  or  two  iterations  of  BP  might  perform  better.  This  leads  to  the 
algorithms  that  we  explore  in  this  section,  namely  shared-unary  piecewise  and  one- 
step  cutout.  In  this  section,  I  assume  that  all  factors  are  weakly  canonical  form,  by 
which  I  mean  that  every  variable  i  has  a  unary  factor  \Eq(?/j,x,  9),  and  for  every  non¬ 
unary  factor  a  and  variable  «  6  a,  the  sum  J2ya\y,  ^ a(ya ,  x,  9)  is  uniform  over  yi.  For 
fixed  x,  this  transformation  is  always  possible,  and  it  means  intuitively  that  none  of 
the  unary  information  is  hidden  within  higher-way  factors.  Note  that  this  is  a  weaker 
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condition  than  the  canonical  form  used  in  the  Hammersley-Clifford  theorem  [7],  in 
that  we  allow  k- way  factors  ( k  >  2)  to  contain  ( k  —  l)-way  information;  for  example, 
a  three-way  factor  may  contain  two-way  information.  1  also  assume  that  the  unary 
factors  are  parameterized  as 


=  exp{0,(y,)}. 


(4.27) 


4.5.1  Shared-Unary  Piecewise 

One  idea  for  improving  the  piecewise  estimate  of  the  unary  gradient  is  to  duplicate 
the  unary  factors  over  every  non-unary  piece  that  they  neighbor.  This  yields  a  new 
approximate  training  method,  that  I  call  shared-unary  piecewise.  Recall  that  the 
standard  piecewise  ZPW  arises  from  ZBP  when  all  Ta  =  1.  In  shared-unary  piecewise, 
rather  than  approximating  the  unary  factors  by  T.;  =  1,  we  incorporate  them  exactly, 
that  is,  we  take  fPj(yj)  =  \Eh(yj,x,  6)  for  the  unary  factors,  and  Ta  =  1  otherwise. 
These  messages  are  the  result  of  one  parallel  BP  iteration  from  uniform  messages,  so 
1  call  them  parallel  BP(  1 )  messages.  These  messages  yield  the  approximate  partition 
function: 

a&NU 


2fpWTT  - 


n 


2 —di 


yi  x’  n  x>  ^) )  n  ( s  x’  ° ) 


(4.28) 


Ida 


Vi 


where  NU  is  the  set  of  nonunary  factors  in  the  model.  We  can  distribute  terms  in 
Zpwu  to  yield  a  form  similar  to  standard  piecewise.  First,  we  define  a  normalized 
version  of  the  unary  factors  as 


Pi(Vi) 


Ej $i(ix,0)' 


Then  we  can  distribute  terms  in  (4.28)  to  obtain 


(4.29) 
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Iterations 

0 

Ta  =  1  for  all  a 

1 

Ta  =  1  for  nonunary  a;  dy  =  T,  for  variables  % 

2 

=  Uieamai  (Vi),  where 

mahvi)  =  E,W^a(yo,X,0)n;^(a)\,^(%-,x,^ 

Table  4.6.  Message  settings  after  zero,  one,  and  two  parallel  iterations  of  BP.  Recall 
that  the  nonunary  factors  are  assumed  to  be  weakly  canonical. 


zPwu  =  n  ( 5z^o(yo,x,0)  n  (4-3°) 

a£NU  y  ya  i€N(a)  J  i  m 

So  shared-unary  piecewise  is  the  same  as  regular  piecewise,  except  that  we  share 
normalized  unary  factors  among  all  of  the  higher-way  pieces  that  they  neighbor.  By 
normalizing  the  unary  factors  before  spreading  them  across  all  the  pieces,  intuitively 
we  avoid  overcounting  their  sum. 


4.5.2  Shared-Unary  Pseudomarginals 

In  this  section,  we  derive  the  pseudomarginals  for  shared-unary  piecewise.  Taking 
the  derivative  of  log  ZPWV  yields 


8  log  ZpWU  y  Eya  *q( yq.M)  (nieN(a)\iPj(Pj))  •  Myj 

d9i ^  adNU(i)  Eyo^a(ya,X,0)  (niGiV(o)^(%)) 


(4.31) 


where  NU{i)  is  set  of  all  nonunary  factors  that  neighbor  variable  i.  Then  substituting 
in 

=  p(vi) [V=2/d  -  pM)}  (4-32) 

yields 
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Standard  piecewise 

Neighborhood  graph 

node-split  graph  (Section  4.1.1) 

Pseudomarginal  view 

Messages  at  zero  iterations 

Belief  view 

ba( ya)  oc  vha(ya,x,  6) 

Shared-unary  piecewise 

Neighborhood  graph 

Pseudomarginal  view 

node-split  graph  with  pieces  vk,  for  each  variable 
i,  and  {fa}  U  (ft  i  G  a}  for  each  nonunary  a 
Messages  after  one  parallel  iteration 

Belief  view 

Summation-hack  approximation  to  BP  beliefs  af¬ 
ter  two  parallel  iterations. 

Cutout  (one-step) 
Neighborhood  graph 

node-split  graph  with  pieces  Ga  for  each  factor  a, 
where  Ga  defined  as  in  (4.37) 

Pseudomarginal  view 

none 

Belief  view 

BP  beliefs  after  two  parallel  iterations 

Table  4.7.  Viewpoints  on  local  training  algorithms  discussed  in  this  note.  For  each 
method,  “Neighborhood  graph”  means  the  graph  on  which  the  method  can  be  viewed 
as  performing  exact  maximum  likelihood  training.  “Pseudomarginal  view”  lists  the 
message  setting  with  which  the  method  approximates  log  Z  by  log  ZBF.  “Belief  view” 
gives  the  beliefs  with  which  the  method  can  be  viewed  as  approximating  the  gradient 
of  the  true  likelihood.  For  reference,  the  BP  messages  after  zero,  one,  and  two  parallel 
iterations  are  given  in  Table  4.6. 
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d  log  ZPWV 
dOi(y') 


Pi{y)  + 


(4.33) 


Eyav  Ta(ya, x,  0)  (jljeN{a) Pj(yj)) 

Ey ■„  ^a(ya,X,0)  (n jeN(a)Pj(yj)) 


Pi(y) 


=  Pi(y') 


cam 


(2) 


U) 


(4.34) 


where  rrvai  are  the  unnormalized  BP  messages  after  two  parallel  updates.  We  in¬ 
troduce  the  notation  Ca  to  represent  the  denominator  in  (4.33),  which  is  not  the 

(2) 

normalizing  constant  of  rnai  . 


4.5.3  Cutout  Method 

The  cutout  method  approximates  the  true  gradient  by  performing  exact  inference 
on  a  subgraph.  Each  parameter  is  assigned  a  its  own  subgraph,  but  the  subgraphs 
are  allowed  to  overlap.  Given  a  subgraph  Ga  for  each  factor  a,  we  define  the  subgraph 
likelihood  £a  as  the  exact  likelihood  over  the  graph  Ga.  Let  A  be  the  set  of  factors  in 
Ga  and  Fa  =  flbeA  ^  Then  the  subgraph  likelihood  can  be  written 


£a(da) 


fa(  yq,x,fl) 

Ey^(y(uM) 


fa(  ya,x,g) 
Za 


(4.35) 


Then  the  parameter  vector  6  is  selected  to  solve  the  system  of  equations  d£a/ dOa  =  0 
for  all  a.  In  general,  there  does  not  exist  a  single  objective  function  1(6)  whose 
partial  derivatives  match  all  of  the  d£a/ddai  because  the  vector  held  defined  by 
diaf 86 a  has  nonzero  curl.  In  two  dimensions,  the  curl  of  a  vector  held  H(x  1,^2)  = 
x2)  f2(x i,x2)}  is  given  by 


curl  Ff 


df2  <9/i 

dx  1  dx2 


(4.36) 


It  is  a  standard  theorem  of  vector  calculus  that  a  piecewise  continuous  vector  held  over 
Mn  is  the  gradient  of  a  function  if  and  only  if  it  has  zero  curl.  To  see  this,  observe  that 
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if  H  has  nonzero  curl,  it  is  the  gradient  of  a  function  /  only  if  df /dx\X2  ^  df/dx2x i, 
which  is  impossible.  The  cutout  vector  field  has  nonzero  curl  essentialy  because  each 
6a  is  used  in  many  £b,  but  only  one  £a  is  used  to  compute  its  approximate  gradient. 

Here  I  focus  on  a  special  case  of  the  cutout  method,  the  one-step  cutout  method. 
In  one-step  cutout,  we  choose  Ga  to  be  all  of  the  neighboring  factors  of  fa,  plus  their 
unaries.  That  is,  Ga  is  a  factor  graph  with  factors 

A  —  {^6 1  factor  Tj,  is  distance  2  or  less  from  factor  Ta}.  (4.37) 

(When  counting  distance  between  factors,  we  do  not  count  variables,  so  that  a  path 
a  —  i—b  in  the  factor  graph  counts  as  one  step.)  In  many  situations,  the  cutout  graph 
Ga  is  a  tree,  even  when  the  original  graph  G  is  not,  for  example  when  G  is  a  grid.  If 
Ga  is  a  tree,  then  we  can  compute  d£a/dOa  exactly  using  two  parallel  iterations  of  BP 
on  the  original  graph  G.  To  see  this,  observe  that  because  Ga  is  a  tree  of  diameter 
4,  we  can  exactly  compute  Z4  by  performing  two  parallel  BP  iterations  on  Ga .  But 
the  two-iteration  messages  on  Ga  are  the  same  as  the  two-iteration  messages  on  the 
original  graph,  which  are  given  in  Table  4.6. 

The  beliefs  at  the  parallel  BP  (2)  message  setting  are 

b(a\ya)  OC  Ta(ya,X,0)  Y[  (4-38) 

i£N(i) 

So  from  the  belief  viewpoint,  one-step  cutout  approximates  the  ML  gradient  by  sub¬ 
stituting  the  beliefs  (4.38)  for  the  marginal  probabilities. 

4.5.4  Shared  Unary  as  an  Approximation  to  Cutout 

Shared-unary  piecewise  can  be  viewed  as  an  approximation  to  the  cutout  method. 
A  general  way  of  approximating  a  product  of  terms  is  the  “summation  hack” : 
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(4.39) 


where  the  approximation  arises  from  a  first-order  Taylor  expansion  around  e  =  1. 
Applying  the  summation  hack  to  the  one-step  cutout  beliefs  (4.38),  we  obtain 


bf]  (: yi )  oc  x,  6)  JJ  Carn^i  (yi) 

a£NU(i) 


(4.40) 


~  ^i(j/i,X,0) 


Cam^iyi 


(4.41) 


which  are  the  same  as  the  pseudomarginals  (4.34)  of  ZPWU,  up  to  the  proportional¬ 
ity  constant  of  pt.  So  we  can  view  the  shared- unary  pseudomarginals  either  as  the 
pseudomarginals  after  one  BP  iteration,  or  as  an  approximation  to  the  beliefs  after 
two  iterations.  This  leads  us  to  expect  two  sources  of  error  in  shared-unary  piece- 
wise:  error  may  arise  either  from  the  summation  hack,  or  because  the  model  has 
long-distance  interactions  that  cannot  be  propagated  in  two  parallel  BP  iterations. 


4.5.5  Simulations 

These  intuitions  can  be  validated  by  simulation  on  a  simple  network.  This  data 
is  generated  from  a  three-node  network  of  binary  variables  with  pairwise  factors 


'ha(ya) 


(4.42) 


and  unary  factors  vIq(y,;)  =  [1  e~u ].  We  transform  the  pairwise  factors  into  a  three- 
variable  exclusive-or  factor  times  a  unary  factor,  so  that  from  the  perspective  of  the 
learning  algorithm,  all  the  factors  are  unary.  We  focus  on  how  the  approximations 
to  log  Z  and  its  derivatives  change  as  a  function  of  the  model  parameters.  This 
is  useful  to  study  because  the  log  likelihood  equals  log  Z  plus  a  linear  function  of 
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Figure  4.3.  Comparison  of  piecewise  and  shared-unary  piecewise  approximations  as 
a  function  of  the  equality  strength  s.  Top,  approximation  to  log  Z;  bottom,  approxi¬ 
mation  to  its  derivative. 
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Figure  4.4.  Approximation  to  log  Z  by  piecewise  and  shared-unary  piecewise  as  a 
function  of  the  unary  strength  u.  Top,  approximation  to  logZ;  bottom,  approxima¬ 
tion  to  its  derivative. 
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Absolute  Gradient  Error  (PW) 
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Figure  4.5.  Absolute  gradient  error  of  standard  piecewise  as  a  function  of  the  unary 
parameter  and  the  sum  of  its  incoming  message  strengths. 
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Figure  4.6.  Difference  in  absolute  gradient  error  between  shared-unary  piecewise 
and  standard  piecewise. 
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Figure  4.7.  Difference  in  gradient  error  between  shared-unary  and  standard  piece- 
wise.  At  left,  when  the  magnitude  of  dp(yi)/du3  is  small,  then  shared-unary  is  su¬ 
perior.  At  right,  where  this  derivative  is  large,  shared-unary  and  standard  piecewise 
are  equivalent. 
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Figure  4.8.  Difference  in  gradient  error  between  cutout  method  and  shared-unary 
piecewise  as  a  function  of  the  message  strengths  (left).  At  right,  contours  of  the  error 
of  the  “summation  hack”  Taylor  expansion. 
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the  parameters,  so  examining  log  Z  alone  gives  insight  into  how  the  approximation 
performs  for  any  data  set  [127]. 

First,  we  look  at  single-dimensional  plots  of  the  approximate  gradients,  in  which 
all  of  the  unary  parameters  are  tied,  and  we  vary  either  the  unary  strength  u  or 
the  equality  strength  s.  As  we  vary  the  equality  strength,  for  a  fixed,  strong  unary 
strength  of  e~u  =  0.2,  then  shared-unary  piecewise  provides  a  much  better  approxima¬ 
tion  to  log  Z  as  a  function  of  the  equality  strength  s  (Figure  4.3).  As  s  approaches  0, 
the  pairwise  factors  drop  out,  so  that  both  the  piecewise  approximations  are  exact.  In 
both  Figures  4.3  and  4.4,  we  subtract  logZ(O)  from  logZPW.  Without  that  correction, 
ZPW  is  an  upper  bound  but  a  strong  overestimate.  Also,  in  both  figures,  the  plotted 
derivative  is  the  negative  of  the  pseudomarginal,  because  of  the  parameterization  we 
use. 

When  we  vary  the  unary  strengths,  on  the  other  hand,  shared  unary  has  less  de¬ 
sirable  behavior  (Figure  4.4).  This  figure  shows  the  approximations  to  log  Z  and  its 
derivative  for  a  fixed  equality  strength  e~s  =  0.2.  Of  course,  as  u  approaches  0,  the 
unary  factors  drop  out,  so  that  shared  unary  becomes  equivalent  to  standard  piece- 
wise.  Elsewhere,  however,  we  see  that  shared-unary  piecewise  is  no  longer  convex, 
because  of  the  per-node  normalization,  and  in  fact  we  see  a  large  regime  where  ZPWV 
is  increasing  while  the  true  Z  is  decreasing.  Consequently,  the  derivative  of  ZPWU 
crosses  zero  at  several  points  when  the  exact  objective  does  not,  which  is  undesirable 
in  an  approximation.  In  other  words,  the  piecewise  pseudomarginal  is  sometimes 
negative. 

We  can  get  a  more  precise  sense  of  when  the  piecewise  approximations  break  down 
by  examining  their  approximation  error  as  a  function  of  the  incoming  messages  from 
the  rest  of  the  network.  To  do  this,  we  use  a  four-node  Potts  network  of  the  form 
above,  which  is  easier  to  interpret  because  there  are  no  odd-length  cycles.  Also,  it  is 
helpful  to  leave  the  unary  parameters  untied,  so  that  the  model  has  five  parameters 
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[s,  Mi,  u2,  u3,  u4] .  We  generate  models  by  sampling  uniformly  over  all  parameter  values 
in  the  range  [—4, 4] .  We  measure  the  error  in  the  pseudomarginal  of  iji  for  both 
standard  and  shared-unary  piecewise.  We  plot  the  error  in  the  pseudomarginal  as  a 
function  of  the  message  strengths  m2 1  and  77141  of  the  incoming  messages  to  ij\  from 
its  neighbors  y2  and  1J4.  By  message  strength,  we  mean  the  log  ratio  of  the  message 
value  at  0  over  the  message  value  at  1. 

For  standard  piecewise  (Figure  4.5),  we  see  as  expected  that  the  approximation 
error  is  greatest  when  the  incoming  messages  are  both  strong  and  in  disagreement 
with  the  local  unary  parameter.  For  shared  unary,  on  the  other  hand,  we  report  the 
difference  in  gradient  error  between  shared-unary  and  standard  piecewise  (Figure  4.6). 
First,  shared  unary  improves  greatly  over  standard  piecewise  in  the  areas  where  piece- 
wise  performs  worst,  that  is,  when  the  incoming  messages  disagree  strongly  with  the 
local  unary.  But  shared  unary  is  not  always  better  than  standard  piecewise,  especially 
when  the  messages  are  weak.  This  may  be  surprising  because  shared-unary  performs 
an  extra  iteration  of  BP.  However,  the  BP  view  of  shared  unary  suggests  two  possible 
sources  of  error.  First,  one  BP  iteration  can  actually  be  worse  than  zero  iterations, 
if  nearby  potentials  contradict  stronger  factors  elsewhere  in  the  network.  Second, 
shared  unary  may  be  a  bad  approximation  to  the  two-iteration  BP  beliefs  because  of 
the  summation  hack. 

We  can  isolate  these  two  sources  of  error.  First,  to  see  where  one  iteration  of  BP 
might  actually  hurt,  we  look  at  the  derivative  of  p(y  1),  the  exact  marginal  probability 
of  variable  ij\ ,  taken  with  respect  to  u3,  the  unary  parameter  opposite  ij\  in  the  graph. 
If  this  derivative  has  large  magnitude,  then  we  expect  that  the  parameter  u3  has  a 
large  impact  on  the  marginal  p(yi),  so  that  one  iteration  can  make  the  beliefs  worse 
if  7/2  and  y4  have  an  effect  in  the  opposite  direction  as  y3.  In  Figure  4.7,  we  show 
the  error  difference  between  shared-unary  and  standard  piecewise  as  a  function  of 
\dp(yi) /du3\.  As  this  argument  predicts,  when  this  derivative  has  large  magnitude, 
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then  the  information  from  u 3,  which  neither  method  considers,  is  most  important,  so 
that  neither  method  dominates.  When  this  derivative  has  small  magnitude,  then  the 
local  information  is  more  important,  so  shared-unary  piecewise  dominates. 

Second,  we  can  examine  the  effects  of  the  summation  hack  by  plotting  the  dif¬ 
ference  in  gradient  error  between  shared  unary  and  cutout.  The  summation  hack  is 
accurate  near  the  axes,  so  if  both  messages  are  strong,  we  expect  shared  unary  to 
have  high  error.  In  Figure  4.8  it  can  be  seen  that  shared  unary  performs  worse  than 
one-step  cutout  at  exactly  the  places  where  the  summation  hack  predicts. 

An  interesting  observation  here  is  that  the  cutout  method  is  itself  not  always 
better  than  shared-unary  piecewise,  even  though  cutout  performs  an  extra  iteration 
of  BP.  This  happens  in  cases  when  U3  has  large  magnitude,  so  neither  method  can 
do  well.  In  these  cases,  it  can  happen  that  the  summation  hack  error  pushes  the 
shared-unary  marginal  in  the  direction  of  the  correct  marginal,  so  that  shared  unary 
performs  better. 

In  summary,  we  have  seen  that  in  many  situtions,  shared-unary  piecewise  provides 
a  better  approximation  to  log  Z  and  its  derivatives  than  standard  piecewise.  The 
occasions  when  standard  piecewise  performs  better  than  shared  unary  occur  when 
there  is  strong  influence  from  outside  the  pieces,  in  which  case  neither  piecewise 
method  is  probably  advisable.  We  have  also  demonstrated  in  simulation  two  potential 
sources  of  error  in  shared-unary  piecewise:  strong  iteractions  from  outside  the  piece, 
and  the  summation  hack.  A  potentially  serious  drawback  to  shared-unary  piecewise, 
however,  it  is  not  convex  in  the  parameters.  Indeed,  an  important  limitation  of  these 
simulations  is  that  we  have  looked  only  at  error  in  the  gradient,  not  error  in  the 
optimal  parameter  settings,  so  we  cannot  assess  to  what  extent  the  loss  of  convexity 
makes  it  harder  to  find  good  parameters. 
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4.6  Related  Work 

Because  the  piecewise  estimator  is  such  an  intuitively  appealing  method,  it  has 
been  used  in  several  scattered  places  in  the  literature,  for  tasks  such  as  informa¬ 
tion  extraction  [147],  collective  classification  [44],  and  computer  vision  [34],  In  these 
papers,  the  piecewise  method  is  reported  as  a  successful  heuristic  for  training  large 
models,  but  its  performance  is  not  compared  against  other  training  methods.  We  are 
unaware  of  previous  work  systematically  studying  this  procedure  in  its  own  right. 

As  mentioned  earlier,  the  most  closely  related  procedure  that  has  been  studied 
statistically  is  pseudolikelihood  [8,  9].  The  main  difference  is  that  piecewise  training 
does  not  condition  on  neighboring  nodes,  but  ignores  them  altogether  during  training. 
This  is  depicted  schematically  by  the  factor  graphs  in  Figure  4.2.  In  pseudolikelihood, 
each  locally- normalized  term  for  a  variable  or  edge  in  pseudolikelihood  includes  con¬ 
tributions  from  a  number  of  factors  that  connect  to  the  neighbors  whose  observed 
values  are  taken  from  labeled  training  data.  All  these  factors  are  circled  in  the  top 
section  of  Figure  4.2.  In  piecewise  training,  each  factor  becomes  an  independently, 
locally-normalized  term  in  the  objective  function. 

Also,  in  statistics  there  has  been  work  on  general  families  of  surrogate  likelihoods, 
called  composite  likelihoods,  which  are  sums  of  marginal  or  conditional  log  likelihoods 
[63].  Such  composite  likelihoods  are  consistent  and  asymptotically  normal  under  rela¬ 
tively  general  assumptions.  An  example  of  using  a  composite  likelihood  on  structured 
models  for  natural-language  data  is  Kakade  et  al.  [51].  But  these  are  designed  for  a 
different  situation  than  ours,  namely  when  the  joint  likelihoods  are  difficult  to  com¬ 
pute  but  marginal  likelihoods  are  easier  to  work  with.  An  example  of  this  situation  is 
the  multivariate  Gaussian.  In  our  context,  marginal  likelihoods  are  difficult  to  com¬ 
pute,  so  composite  likelihoods  are  not  as  useful.  Piecewise  estimation  is  not  a  type  of 
composite  likelihood,  because  in  the  likelihood  of  each  piece,  the  contribution  of  the 
rest  of  the  model  is  ignored,  not  marginalized  out. 
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Independently,  Choi  et  al.  [18]  present  a  node-splitting  technique  for  upper  bounds 
during  inference,  which  is  closely  related  to  the  technique  in  this  chapter  used  for 
learning. 

4.7  Conclusion 

This  chapter  has  presented  piecewise  training,  an  intuitively  appealing  procedure 
that  separately  trains  factor  subsets,  called  pieces,  of  a  loopy  graph.  We  show  that 
this  procedure  can  be  justified  as  maximizing  a  loose  bound  on  the  log  likelihood.  On 
three  real-world  language  tasks  with  different  model  structures,  piecewise  training 
outperforms  several  versions  of  pseudolikelihood,  a  traditional  local  training  method. 
On  two  of  the  data  sets,  in  fact,  piecewise  training  is  more  accurate  than  global 
training  using  belief  propagation. 

Many  properties  of  piecewise  training  remain  to  be  explored.  Our  results  indicate 
that  in  some  situations  piecewise  training  should  replace  pseudolikelihood  as  the  local 
training  method  of  choice.  In  particular,  the  experiments  here  all  used  conditional 
training,  which  make  local  training  easier  because  of  the  large  amount  of  informa¬ 
tion  in  the  conditioning  variables.  In  the  data  sets  here,  the  local  features,  such  as 
the  word  identity,  provide  almost.  In  generative  training,  there  may  be  much  less 
local  information,  making  piecewise  training  much  less  effective.  On  the  other  hand, 
from  the  exponential  family  perspective,  piecewise  training  does  still  match  expected 
statistics  of  a  subgraph  to  the  empirical  distribution,  which  still  seems  intuitively 
appealing.  For  this  reason,  it  is  hard  to  give  a  definitive  characterization  of  when 
piecewise  training  is  expected  to  work  well  or  poorly. 

A  possible  explanation  for  the  performance  of  piecewise  training  is  that  it  acts  as 
a  form  of  additional  regularization,  in  that  the  objective  function  disfavors  parameter 
settings  that  obtain  good  joint  likelihood  by  using  long-distance  effects  of  weights. 
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For  this  reason,  connections  to  generalization  bound  for  feature  selection,  some  of 
which  take  into  account  the  amount  of  computation,  may  be  interesting. 
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CHAPTER  5 


PIECEWISE  PSEUDOLIKELIHOOD 


Piecewise  training  can  be  an  effective  training  method  when  the  model  structure 
is  intractable.  If  the  variables  have  large  cardinality,  however,  training  can  be  com¬ 
putationally  demanding  even  when  the  model  structure  is  tractable.  For  example, 
consider  a  series  of  processing  steps  of  a  natural- language  sentence  [33,  125],  which 
might  begin  with  part-of-speech  tagging,  continue  with  more  detailed  syntactic  pro¬ 
cessing,  and  finish  with  some  kind  of  semantic  analysis,  such  as  relation  extraction  or 
semantic  entailment.  This  series  of  steps  might  be  modeled  as  a  simple  linear  chain, 
but  each  variable  has  an  enormous  number  of  outcomes,  such  as  the  number  of  parses 
of  a  sentence.  In  such  cases,  even  training  using  forward-backward  is  infeasible,  be¬ 
cause  it  is  quadratic  in  the  variable  cardinality.  Thus,  we  desire  approximate  training 
algorithms  not  only  that  are  sub  exponential  in  the  model’s  treewidth,  but  also  that 
scale  well  in  the  variable  cardinality. 

Pseudolikelihood  (PL)  [8]  is  a  classical  training  method  that  addresses  both  of 
these  issues,  both  because  it  requires  no  propagation  and  also  because  its  running  time 
is  linear  in  the  variable  cardinality.  Although  in  some  situations  pseudolikelihood  can 
be  very  effective  [90,  134],  in  other  applications,  its  accuracy  can  be  poor. 

An  alternative  that  has  been  employed  occasionally  throughout  the  literature  is 
to  divide  the  factors  in  the  model  into  a  set  of  pieces,  and  train  each  piece  separately, 
in  its  own  graphical  model.  In  Chapter  4,  I  presented  this  piecewise  estimation 
method,  finding  that  it  performs  well  when  the  local  features  are  highly  informative, 
as  can  be  true  in  a  lexicalized  NLP  model  with  thousands  of  features.  As  we  saw, 
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piecewise  performs  better  than  pseudolikclihood  on  certain  types  of  data,  sometimes 
by  a  very  large  amount.  So  piecewise  training  can  have  good  accuracy,  however, 
unlike  pseudolikclihood  it  does  not  scale  well  in  the  variable  cardinality. 

In  this  chapter,  I  introduce  and  analyze  a  hybrid  method,  called  piecewise  pseu¬ 
dolikelihood  (PWPL),  that  combines  the  advantages  of  both  approaches.  Essentially, 
while  pseudolikelihood  conditions  each  variable  on  all  of  its  neighbors,  PWPL  condi¬ 
tions  only  on  those  neighbors  within  the  same  piece  of  the  model,  for  example,  that 
share  the  same  factor.  This  is  illustrated  in  Figure  5.2.  Remarkably,  although  PWPL 
has  the  same  computational  complexity  as  pseudolikclihood,  on  real-world  NLP  data, 
its  accuracy  is  significantly  better.  In  other  words,  in  testing  accuracy  PWPL  be¬ 
haves  more  like  piecewise  than  like  pseudolikelihood.  The  training  speed-up  of  PWPL 
can  be  significant  even  in  linear-chain  CRFs,  because  forward-backward  training  is 
quadratic  in  the  variable  cardinality. 

The  chapter  proceeds  as  follows.  In  Section  5.2.1,  I  describe  PWPL  in  terms  of 
the  node-split  graph,  which  was  presented  previously  in  Section  4.1.1.  This  viewpoint 
allows  us  to  show  that  under  certain  conditions,  PWPL  converges  to  the  piecewise 
solution  in  the  asymptotic  limit  of  infinite  data  (Section  5.2.2).  In  addition,  it  provides 
some  insight  into  when  PWPL  may  be  expected  to  do  well  and  to  do  poorly,  an  insight 
that  we  verify  on  synthetic  data  (Section  5.3.1).  Finally,  I  evaluate  PWPL  on  several 
real-world  NLP  data  sets  (Section  5.3.2),  finding  that  it  performs  often  comparably 
to  piecewise  training  and  to  maximum  likelihood,  and  on  all  of  our  data  sets  PWPL 
has  higher  accuracy  than  pseudolikclihood.  Furthermore,  PWPL  can  be  as  much  as 
ten  times  faster  than  batch  CRF  training. 
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Figure  5.1.  Example  of  node  splitting.  Left  is  the  original  model,  right  is  the  version 
trained  by  piecewise.  In  this  example,  there  are  no  unary  factors. 


O 


o 


Figure  5.2.  Illustration  of  the  difference  between  piecewise  pseudolikelihood 
(PWPL)  and  standard  pseudolikelihood.  In  standard  PL,  at  left,  the  local  term 
for  a  variable  ys  is  conditioned  on  its  entire  Markov  blanket.  In  PWPL,  at  right,  each 
local  term  conditions  only  on  the  neighbors  within  a  single  factor. 


5.1  Piecewise  Training 

5.1.1  Background 

In  this  section,  I  make  a  few  observations  about  pseudolikelihood  and  piecewise 
training  that  will  be  useful.  As  in  the  previous  chapters,  we  are  interested  in  estimat¬ 
ing  the  parameters  of  a  conditional  random  held  p(y|x)  of  the  form 

1 

?(yix)=zwn$“(y“’x“)'  (s-1) 

'  '  a=  1 

where  the  factors  have  the  exponential  form 

K 

^a(ya,  xa)  =  exp{^  0akfa( ya,  xa)},  (5.2) 

k= 1 
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Recall  from  Chapter  4  that  pseudolikelihood  is  a  classical  approximation  that  simulta¬ 
neously  classifies  each  node  given  its  neighbors  in  the  graph.  For  a  variable  s,  let  N(s) 
be  the  set  of  all  of  its  neighbors,  not  including  s  itself.  Then  the  pseudolikelihood  is 
defined  as 


4.  (A)  =  ^2l°SP(ys\yN(s),^), 

s 

where  the  conditional  distributions  are 


p(ys\yN(s),X-) 


vh0  (ys ,  yM(s) ,  xa) 
Ey'  ria9s  ^ a(y'si  yN(s)i  xa) 


(5.3) 


where  a  3  s  means  the  set  of  all  factors  a  that  depend  on  the  variable  s.  In  other 
words,  this  is  a  sum  of  conditional  log  likelihoods,  where  for  each  variable  we  condition 
on  the  true  values  of  its  neighbors  in  the  training  data. 

It  is  a  well-known  result  that  if  the  model  family  includes  the  true  distribution, 
then  pseudolikelihood  converges  to  the  true  parameter  setting  in  the  limit  of  infinite 
data  [41,  48].  One  way  to  see  this  is  that  pseudolikelihood  is  attempting  to  match 
all  of  model  conditional  distributions  to  the  data.  If  it  succeeds  in  matching  them 
all  exactly,  then  a  Gibbs  sampler  run  on  the  model  distribution  will  have  the  same 
invariant  distribution  as  a  Gibbs  sampler  run  on  the  true  data  distribution. 

In  the  previous  chapter,  I  also  presented  piecewise  training,  based  on  the  intuition 
that  if  each  factor  T(ya,xa)  can  on  its  own  accurately  predict  ya  from  xa,  then 
the  prediction  of  the  global  factor  graph  will  also  be  accurate.  Formally,  piecewise 
training  maximizes  the  objective  function 


4w(A)  =  los 

a 


d,a(ya,xa) 

Ey^  ^(y>a)' 


(5.4) 


The  explanation  for  the  name  piecewise  is  that  each  term  in  (5.4)  corresponds  to  a 
“piece”  of  the  graph,  in  this  case  a  single  factor,  and  that  term  would  be  the  exact 
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likelihood  of  the  piece  if  the  rest  of  the  graph  were  omitted.  An  important  observation 
is  that  the  denominator  of  (5.3)  sums  over  assignments  to  a  single  variable,  whereas 
the  denominator  of  (5.4)  sums  over  assignments  to  an  entire  factor,  which  may  be 
a  much  larger  set.  This  is  why  pseudolikelihood  can  be  much  more  computationally 
efficient  than  piecewise  when  the  variable  cardinality  is  large. 

5.2  Piecewise  Pseudolikelihood 

In  this  section,  I  define  piecewise  pseudolikelihood  (Section  5.2.1),  and  describe  its 
asymptotic  behavior  using  well-known  results  about  pseudolikelihood  (Section  5.2.2). 

5.2.1  Definition 

The  main  motivation  of  piecewise  training  is  computational  efficiency,  but  in  fact 
piecewise  does  not  always  provide  a  large  gain  in  training  time  over  other  approximate 
methods.  In  particular,  the  time  required  to  evaluate  the  piecewise  likelihood  at  one 
parameter  setting  is  the  same  as  is  required  to  run  one  iteration  of  belief  propagation 
(BP).  More  precisely,  piecewise  training  uses  0(mK)  time,  where  m  is  the  maximum 
number  of  assignments  to  a  single  variable  ys  and  K  is  the  size  of  the  largest  factor. 
Belief  propagation  also  uses  0(mK)  time  per  iteration;  thus,  the  only  computational 
savings  over  BP  is  a  factor  of  the  number  of  BP  iterations  required.  In  tree-structured 
graphs,  piecewise  training  is  no  more  efficient  than  forward-backward. 

To  address  this  problem,  we  propose  piecewise  pseudolikelihood.  Piecewise  pseu¬ 
dolikelihood  (PWPL)  is  defined  as: 


4vPL(0;x,y)  =  J^J^logpLCL(|/s|ya\s,x,6»a),  (5.5) 

a  sEa 

where  (x,  y)  are  an  observed  data  point,  the  index  a  ranges  over  all  factors  in  the 
model,  the  set  a\s  means  all  of  the  variables  in  the  domain  of  factor  a  except  for  s, 
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and  pLCh  is  a  locally-normalized  score  similar  to  a  conditional  probability  and  defined 
below. 

In  other  words,  the  piecewise  pseudolikelihood  is  a  sum  of  local  conditional  log- 
probabilities.  Each  variable  s  participates  as  the  domain  of  a  conditional  once  for  each 
factor  that  it  neighbors.  As  in  piecewise  training,  the  local  conditional  probabilities 
pLCL  are  not  the  true  probabilities  according  to  the  model,  but  are  a  quantity  computed 
locally  from  a  single  piece  (in  this  case,  a  single  factor).  The  local  probabilities  pLCL 
are  defined  as 


PLCL(.ys\ya\s,x,Oa) 


^qfe,y«y,Xa) 
EyJ  ^a(2/s>ya\S,Xa)' 


(5.6) 


Then  given  a  data  set  D  =  {(xW,yW)},  we  select  the  parameter  setting  that  maxi¬ 


mizes 


0„.,l(»|B)  =  x»,y<*>)  -  (5.7) 

i  a 

where  the  second  term  is  a  Gaussian  prior  on  the  parameters  to  reduce  overfitting. 
The  piecewise  pseudolikclihood  is  convex  as  a  function  of  9,  and  so  its  maximum  can 
be  found  by  standard  techniques.  In  the  experiments  below,  we  use  limited-memory 


BFGS  [89], 


For  simplicity,  we  have  presented  PWPL  for  the  case  in  which  each  piece  contains 
exactly  one  factor.  If  larger  pieces  are  desired,  then  simply  take  the  summation  over 
a  in  (5.5)  to  be  over  pieces  rather  than  over  factors,  and  generalize  the  definition  of 
Plcl  appropriately. 

Compared  to  standard  piecewise,  the  main  advantage  of  PWPL  is  that  training 
requires  only  0(m)  time  rather  than  0(mK).  Compared  to  pseudolikclihood,  the 
difference  is  that  whereas  in  pseudolikclihood  each  local  term  conditions  on  the  entire 
Markov  blanket,  in  PWPL  each  local  term  conditions  only  on  a  variable’s  neighbors 
within  a  single  factor.  For  this  reason,  the  local  terms  in  PWPL  are  not  true  con¬ 
ditional  distributions  according  to  the  model.  The  difference  between  PWPL  and 
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pseudolikelihood  is  illustrated  in  Figure  5.2.  In  the  next  section,  we  discuss  why  in 
some  situations  this  can  cause  PWPL  to  have  better  accuracy  than  pseudolikclihood. 

5.2.2  Analysis 

PWPL  can  be  readily  understood  from  the  node-split  viewpoint.  In  particular, 
the  piecewise  pseudolikclihood  is  simply  the  standard  pseudolikclihood  applied  to 
the  node-split  graph.  In  this  section,  we  use  the  asymptotic  consistency  of  standard 
pseudolikclihood  to  gain  insight  into  the  performance  of  PWPL. 

Let  p*  (y )  be  the  true  distribution  of  the  data,  after  the  node  splitting  transforma¬ 
tion  has  been  applied.  Both  PWPL  and  standard  piecewise  cannot  distinguish  this 
distribution  from  the  distribution  pNS  on  the  node-split  graph  that  is  defined  by  the 
product  of  marginals 

PNs(y)  =  n  P*(ya),  (5.8) 

aeG' 

where  G'  is  the  node-split  graph,  and  p*(ya )  is  the  marginal  distribution  of  the  vari¬ 
ables  in  factor  a  according  to  the  true  distribution.  By  that  we  mean  that  the  piece- 
wise  likelihood  of  any  parameter  setting  9  when  the  data  distribution  is  exactly  the 
true  distribution  p*  is  equal  to  the  piecewise  likelihood  of  9  when  the  data  distribution 
equals  the  distribution  pNS,  and  similarly  for  PWPL. 

So  equivalently,  we  suppose  that  we  are  given  an  infinite  data  set  drawn  from 
the  distribution  pNS.  Now,  the  standard  consistency  result  for  pseudolikclihood  is 
that  if  the  model  class  contains  the  generating  distribution,  then  the  pseudolikeli- 
hood  estimate  converges  asymptotically  to  the  true  distribution.  In  this  setting,  that 
implies  the  following  statement.  If  the  model  family  defined  by  G'  contains  pNS,  then 
piecewise  pseudolikclihood  converges  in  the  limit  to  the  same  parameter  setting  as 
standard  piecewise. 

Because  this  is  an  asymptotic  statement,  it  provides  no  guarantee  about  how 
PWPL  will  perform  on  real  data.  Even  so,  it  has  several  interesting  consequences  that 
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provide  insight  into  the  method.  First,  it  may  impact  what  sort  of  model  is  conducive 
to  PWPL.  For  example,  consider  a  Potts  model  with  unary  factors  T (ys)  =  [1  e0s]T 
for  each  variable  s,  and  pairwise  factors 


I'l'/.. !!,) 


(5.9) 


for  each  edge  (s,f),  so  that  the  model  parameters  are  {6b}  U  {6bt}.  Then  the  above 
condition  for  PWPL  to  converge  in  the  infinite  data  limit  will  never  be  satisfied, 
because  the  pairwise  piece  cannot  represent  the  marginal  distribution  of  its  variables. 
In  this  case,  PWPL  may  be  a  bad  choice,  or  it  may  be  useful  to  consider  pieces  that 
contain  more  than  one  factor.  In  particular,  shared-unary  piecewise  (see  Section  4.5.1) 
may  be  appropriate. 

Second,  this  analysis  provides  intuition  about  the  differences  between  piecewise 
pseudolikelihood  and  standard  pseudolikelihood.  For  each  variable  s  with  neighbor¬ 
hood  N(s),  standard  pseudolikelihood  approximates  the  model  marginal  p(yN{s))  over 
the  neighborhood  by  the  empirical  marginal  p(yN(s))-  We  expect  this  approximation 
to  work  well  when  the  model  is  a  good  fit,  and  the  data  is  ample. 

In  PWPL,  we  perform  the  node-splitting  transformation  on  the  graph  prior  to 
maximizing  the  pseudolikelihood.  The  effect  of  this  is  to  reduce  each  variable’s  neigh¬ 
borhood  size,  that  is,  the  cardinality  of  N(s). 

This  has  two  potential  advantages.  First,  because  the  neighborhood  size  is  small, 
PWPL  may  converge  to  piecewise  faster  than  pseudolikelihood  converges  to  the  exact 
solution.  Of  course,  the  exact  solution  should  be  better  than  piecewise,  so  whether 
to  prefer  standard  PL  or  piecewise  PL  depends  on  precisely  how  much  faster  the  con¬ 
vergence  is.  Second,  the  node-split  model  may  be  able  to  exactly  model  the  marginal 
of  its  neighborhood  in  cases  where  the  original  graph  may  not  be  able  to  model  its 
larger  neighborhood.  Because  the  neighborhood  is  smaller,  the  pseudolikelihood  con- 


138 


Pseudolikelihood  accuracy 

Figure  5.3.  Comparison  of  piecewise  to  pseudolikelihood  on  synthetic  data.  Pseudo- 
likelihood  has  slightly  better  accuracy  on  training  instances  than  piecewise.  (Piece- 
wise  and  PWPL  perform  exactly  the  same;  this  is  not  shown.) 


vergence  condition  may  hold  in  the  node-split  model  when  it  does  not  in  the  original 
model.  In  other  words,  standard  pseudolikclihood  requires  that  the  original  model  is 
a  good  fit  to  the  full  distribution.  In  contrast,  we  expect  piecewise  pseudolikelihood 
to  be  a  good  approximation  to  piecewise  when  each  individual  piece  fits  the  empirical 
distribution  well.  The  performance  of  piecewise  pseudolikelihood  need  not  require 
the  node-split  model  to  represent  the  distribution  across  pieces. 

Finally,  this  analysis  suggests  that  we  might  expect  piecewise  pseudolikelihood 
to  perform  poorly  in  two  regimes:  First,  if  so  much  data  is  available  that  pseudo- 
likelihood  has  asymptotically  converged,  then  it  makes  sense  to  use  pseudolikelihood 
rather  than  piecewise  pseudolikelihood.  Second,  if  features  of  the  local  factors  cannot 
fit  the  training  data  well,  then  we  expect  the  node-split  model  to  fit  the  data  quite 
poorly,  and  piecewise  pseudolikelihood  cannot  possibly  do  well. 

5.3  Experiments 

5.3.1  Synthetic  Data 

In  the  previous  section,  we  argued  intuitively  that  PWPL  may  perform  better  on 
small  data  sets,  and  pseudolikelihood  on  larger  ones.  In  this  section  we  verify  this  in- 
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Figure  5.4.  Learning  curves  for  PWPL  and  pseudolikelihood.  For  smaller  amounts 
of  training  data  PWPL  performs  better  than  pseudolikelihood,  but  for  larger  data 
sets,  the  situation  is  reversed. 


tuition  in  experiments  on  synthetic  data.  The  general  setup  is  replicated  from  Lafferty 
et  al.  We  generate  data  from  a  second-order  HMM  with  transition  probabilities 

pa(yt\yt-i,yt-2)  =  ap2(yt\yt-i,yt-2)  +  (1  -  a)pi(j/t|j/t-i)  (5-10) 

and  emission  probabilities 

pa(xt\yt,  xt-i)  =  ap2(xt\yt,  xt~i)  +  (1  -  a)pi(xt\yt).  (5.11) 

Thus,  for  a  =  0,  the  generating  distribution  pa  is  a  first-order  HMM,  and  for  a  —  1,  it 
is  an  autoregressive  second-order  HMM.  We  compare  different  approximate  methods 
for  training  a  first-order  CRF.  Therefore  higher  values  of  a  make  the  learning  problem 
more  difficult,  because  the  model  family  does  not  contain  second-order  dependencies. 
We  use  five  states  and  26  possible  observation  values.  For  each  setting  of  a ,  we 
sample  25  different  generating  distributions.  From  each  generating  distribution  we 
sample  1,000  training  instances  of  length  25,  and  1,000  testing  instances.  We  use 
a  G  {0,0.1,0.25,0.5,0.75, 1.0},  for  150  synthetic  generating  models  in  all. 
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ML  PL  PW  PWPL 


POS 


Accuracy 

94.4 

94.4 

94.2 

94.4 

Time  (s) 

33846 

6705 

23537 

3911 

Chunking 

Chunk  FI 

91.4 

90.3 

91.7 

91.4 

Time  (s) 

24288 

1534 

5708 

766 

Named-entity 

Chunk  FI 

90.5 

85.1 

90.5 

90.3 

Time  (s) 

52396 

8651 

6311 

4780 

Table  5.1.  Comparison  of  piecewise  pseudolikelihood  to  standard  piecewise  and  to 
pseudolikelihood  on  real-world  NLP  tasks.  Piecewise  pseudolikelihood  is  in  all  cases 
comparable  to  piecewise,  and  on  two  of  the  data  sets  superior  to  pseudolikelihood. 


First,  we  find  that  piecewise  pseudolikelihood  performs  almost  identically  to  stan¬ 
dard  piecewise  training.  Averaged  over  the  150  data  sets,  the  mean  difference  in 
testing  error  between  piecewise  pseudolikelihood  and  piecewise  is  0.002,  and  the  cor¬ 
relation  is  0.999. 

Second,  we  compare  piecewise  to  traditional  pseudolikelihood.  On  this  data,  pseu¬ 
dolikelihood  performs  slightly  better  overall,  but  the  difference  is  not  statistically 


BP 

PL 

PW 

PWPL 

Start- Time 

96.5 

82.2 

97.1 

94.1 

End-Time 

95.9 

73.4 

96.5 

90.4 

Location 

85.8 

73.0 

88.1 

85.3 

Speaker 

74.5 

27.9 

72.7 

65.0 

Table  5.2.  FI  performance  of  PWPL,  piecewise,  and  pseudolikelihood  on  informa¬ 
tion  extraction  from  seminar  announcements.  Both  standard  piecewise  and  piecewise 
pseudolikelihood  outperform  pseudolikelihood. 
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significant  (paired  t-test;  p  >  0.1).  However,  when  we  examine  the  accuracy  as  a 
function  of  training  set  size  (Figure  5.4),  we  notice  an  interesting  two-regime  behav¬ 
ior.  Both  PWPL  and  pseudolikelihood  seem  to  be  converging  to  a  limit,  and  the 
eventual  pseudolikelihood  limit  is  higher  than  PWPL,  but  PWPL  converges  to  its 
limit  faster.  This  is  exactly  the  behavior  intuitively  predicted  by  the  argument  in 
Section  5.2.2:  that  PWPL  can  converge  to  the  piecewise  solution  in  less  training  data 
than  pseudolikelihood  to  its  (potentially  better)  solution. 

Of  course,  the  training  set  sizes  considered  in  Figure  5.4  are  fairly  small,  but  this 
is  exactly  the  case  we  are  interested  in,  because  on  natural  language  tasks,  even  when 
hundreds  of  thousands  of  words  of  labeled  data  are  available,  this  is  still  a  small 
amount  of  data  compared  to  the  number  of  useful  features. 

5.3.2  Real-World  Data 

Now,  we  evaluate  piecewise  pseudolikelihood  on  four  real-world  NLP  tasks:  part- 
of-speech  tagging,  named-entity  recognition,  noun-phrase  chunking,  and  information 
extraction. 

For  part-o f-speech  tagging  (POS),  we  report  results  on  the  WSJ  Penn  Treebank 
data  set.  Results  are  averaged  over  five  different  random  subsets  of  1911  sentences, 
sampled  from  Sections  0-18  of  the  Treebank.  Results  are  reported  from  the  standard 
development  set  of  Sections  19-21  of  the  Treebank.  We  use  a  first-order  linear  chain 
CRF.  There  are  45  part-of-speech  labels. 

For  the  task  of  noun-phrase  chunking  (chunking),  we  use  a  loopy  model,  the  fac¬ 
torial  CRF  introduced  in  Section  3.1.  As  in  that  section,  we  consider  here  the  task 
of  jointly  predicting  part-of-speech  tags  and  segmenting  noun  phrases  in  newswire 
text.  Thus,  the  FCRF  we  use  has  a  two- level  grid  structure.  We  report  results  here 
on  subsets  of  223  training  sentences,  and  the  standard  test  set  of  2012  sentences. 
Results  are  averaged  over  5  different  random  subsets.  There  are  45  different  POS 
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labels,  and  the  three  NP  labels.  We  use  the  same  features  and  experimental  setup  as 
previous  work  [119].  We  report  joint  accuracy  on  (NP,  POS)  pairs;  other  evaluation 
metrics  show  similar  trends. 

In  named-entity  recognition ,  the  task  is  to  find  proper  nouns  in  text.  We  use 
the  CoNLL  2003  data  set,  consisting  of  14,987  newswire  sentences  annotated  with 
names  of  people,  organizations,  locations,  and  miscellaneous  entities.  We  test  on  the 
standard  development  set  of  3,466  sentences.  Evaluation  is  done  using  precision  and 
recall  on  the  extracted  chunks,  and  we  report  F\  =  2 PR/ P+R.  We  use  a  linear-chain 
CRF,  whose  features  are  described  in  Table  4.4. 

Finally,  for  the  task  of  information  extraction ,  we  consider  a  model  with  many 
irregular  loops,  which  is  the  skip  chain  model  introduced  in  Section  3.2.  As  i  that 
section,  the  task  is  to  extract  information  about  seminars  from  email  announcements 
from  a  standard  data  set  [35].  We  use  the  same  features  and  test /training  split  as 
the  previous  work.  The  data  is  labeled  with  four  fields — Start-Time,  End-Time, 
Location,  and  Speaker — and  we  report  token-level  FI  on  each  held  separately. 

For  all  the  data  sets,  we  compare  to  pseudolikelihood,  piecewise  training,  and  con¬ 
ditional  maximum  likelihood  with  belief  propagation.  All  of  these  objective  functions 
are  maximized  using  limited-memory  BFGS.  We  use  a  Gaussian  prior  with  variance 
a2  =  10. 

Stochastic  gradient  techniques,  such  as  stochastic  meta-descent  [110],  would  be 
likely  to  converge  faster  than  the  baselines  we  report  here,  because  all  our  current 
results  use  batch  optimization.  However,  stochastic  gradient  can  be  used  with  PWPL 
just  as  with  standard  maximum  likelihood.  Thus,  although  the  training  time  of  our 
baseline  could  likely  be  improved  considerably,  the  same  is  true  of  our  new  approach, 
so  that  our  comparison  is  fair. 


143 


5.3.3  Results 


For  the  first  three  tasks — part-of-speech  tagging,  chunking,  and  NER — piecewise 
pseudolikclihood  and  standard  piecewise  training  have  equivalent  accuracy  both  to 
each  other  and  to  maximum  likelihood  (Table  5.1).  Despite  this,  piecewise  pseudo- 
likelihood  is  much  more  efficient  than  standard  piecewise  (Table  5.1).  On  the  named- 
entity  data,  which  has  the  fewest  labels,  PWPL  uses  75%  of  the  time  of  standard 
piecewise,  a  modest  improvement.  On  the  data  sets  with  more  labels,  the  difference 
is  more  dramatic:  on  the  POS  data,  PWPL  uses  16%  of  the  time  of  piecewise  and  on 
the  chunking  data,  PWPL  needs  only  13%.  Similarly,  PWPL  is  also  between  is  5  to 
10  times  faster  than  maximum  likelihood. 

The  training  times  of  the  baseline  methods  may  appear  relatively  modest.  If  so, 
this  is  because  for  both  the  chunking  and  POS  data  sets,  we  use  relatively  small 
subsets  of  the  full  training  data,  to  make  running  this  comparison  more  convenient. 
This  makes  the  absolute  difference  in  training  time  even  more  meaningful  than  it  may 
appear  at  first.  Also,  it  may  appear  from  Table  5.1  that  PWPL  is  faster  than  standard 
pseudolikelihood,  but  the  apparent  difference  is  due  to  low-level  inefficiencies  in  our 
implementation.  In  fact  the  two  algorithms  have  similar  complexity. 

On  the  skip  chain  data  (Table  5.2),  standard  piecewise  performs  worse  than  ex¬ 
act  training  using  BP,  and  piecewise  pseudolikelihood  performs  worse  than  standard 
piecewise.  Both  piecewise  methods,  however,  perform  better  than  pseudolikelihood. 

As  predicted  in  Section  5.2.2,  pseudolikclihood  is  indeed  a  better  approximation 
on  the  node-split  graph.  In  Table  5.1,  PL  performs  much  worse  than  ML,  but  PWPL 
performs  only  slightly  worse  than  PW.  In  Table  5.2,  the  difference  between  PWPL 
and  PW  is  larger,  but  still  less  than  the  difference  between  PL  and  ML. 
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5.4  Discussion  and  Related  Work 

Piecewise  training  and  piecewise  pseudolikelihood  can  both  be  considered  types  of 
local  training  methods,  that  avoid  propagation  throughout  the  graph.  Such  training 
methods  have  recently  been  the  subject  of  much  interest  [1,  94,  134],  Of  course,  the 
local  training  method  most  closely  connected  to  the  current  work  is  pseudolikelihood 
itself.  We  are  unaware  of  previous  variants  of  pseudolikelihood  that  condition  on  less 
than  the  full  Markov  blanket. 

An  interesting  connection  exists  between  piecewise  pseudolikelihood  and  maxi¬ 
mum  entropy  Markov  models  (MEMMs)  [72,  100].  In  a  linear  chain  with  variables 
yi . . .  ut ,  we  can  rewrite  the  piecewise  pseudolikelihood  as 

T 

4vpl(0)  =  ^logpLCL(j/t|j/t_i,x)pLCL(j/t_i|j/t,x).  (5.12) 

t= i 

The  Erst  part  of  (5.12)  is  exactly  the  likelihood  for  an  MEMM,  and  the  second  part 
is  the  likelihood  of  a  backward  MEMM.  Interestingly,  MEMMs  crucially  depend  on 
normalizing  the  factors  at  both  training  and  test  time.  To  include  local  normalization 
at  training  time  but  not  test  time  performs  very  poorly.  But  by  adding  the  backward 
terms,  in  PWPL  we  are  able  to  drop  normalization  at  test  time,  and  therefore  PWPL 
does  not  suffer  from  label  bias. 

The  current  work  also  has  an  interesting  connection  to  search-based  learning  meth¬ 
ods  [28].  Such  methods  learn  a  model  to  predict  the  next  state  of  a  local  search 
procedure  from  a  current  state.  Typically,  training  is  viewed  as  classification,  where 
the  correct  next  states  are  positive  examples,  and  alternative  next  states  are  negative 
examples.  One  view  of  the  current  work  is  that  it  incorporates  backward  training 
examples,  that  attempt  to  predict  the  previous  search  state  given  the  current  state. 

Finally,  stochastic  gradient  methods,  which  make  gradient  steps  based  on  sub¬ 
sets  of  the  data,  have  recently  been  shown  to  converge  significantly  faster  for  CRF 
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training  than  batch  methods,  which  evaluate  the  gradient  of  the  entire  data  set  be¬ 
fore  updating  the  parameters  [136].  Stochastic  gradient  methods  are  currently  the 
method  of  choice  for  training  linear-chain  CRFs,  especially  when  the  data  set  is  large 
and  redundant.  However,  as  mentioned  above,  stochastic  gradient  methods  can  also 
be  applied  to  piecewise  pseudolikelihood.  Also,  in  some  cases,  such  as  in  relational 
learning  problems,  the  data  are  not  iid,  and  the  model  includes  explicit  dependencies 
between  the  training  instances.  For  such  a  model,  it  is  unclear  how  to  apply  stochas¬ 
tic  gradient,  but  piecewise  pseudolikclihood  may  still  be  useful.  Finally,  stochastic 
gradient  methods  do  not  address  cases  in  which  the  variables  have  large  cardinality, 
or  when  the  graphical  structure  of  a  single  training  instance  is  intractable. 

5.5  Summary 

This  chapter  has  presented  piecewise  pseudolikelihood  (PWPL),  a  local  training 
method  that  is  especially  attractive  when  the  variables  in  the  model  have  large  cardi¬ 
nality.  Because  PWPL  conditions  on  fewer  variables,  it  can  have  better  accuracy  than 
standard  pseudolikelihood,  and  is  dramatically  more  efficient  than  standard  piecewise, 
requiring  as  little  as  13%  of  the  training  time. 
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CHAPTER  6 


BELIEF  PROPAGATION 


Previously  we  have  seen  that  many  local  training  methods  can  be  interpreted 
as  approximate  training  using  BP  with  early  stopping.  This  raises  the  question  of 
whether  early  stopping  after  larger  numbers  of  iterations  is  generally  useful.  But  we 
have  also  seen  that  early  stopping  interacts  poorly  with  second-order  optimization 
algorithms  (Section  3. 1.6.2),  which  are  particularly  useful  when  there  are  a  large 
number  of  parameters.  In  this  chapter,  I  try  to  make  early  stopping  more  useful  by 
exploring  the  schedules  used  to  prioritize  messages.  The  hope  is  that  early  stopping 
may  be  most  effective  if  we  can  get  as  much  work  as  possible  out  of  the  messages  that 
we  have  time  to  send. 

Many  popular  approximate  inference  methods,  such  as  belief  propagation,  its  gen¬ 
eralizations  EP  [77]  and  GBP  [148],  and  structured  mean-field  methods  [50],  consist 
of  a  set  of  equations  which  are  iterated  to  End  a  fixed  point.  The  fixed-point  updates 
are  not  usually  guaranteed  to  converge.  The  schedule  for  propagating  the  updates  can 
make  a  crucial  difference  both  to  how  long  the  updates  take  to  converge,  and  even 
whether  they  converge  at  all.  Recently,  dynamic  schedules — in  which  the  message 
values  during  inference  are  used  to  determine  which  update  to  perform  next — have 
been  shown  to  converge  much  faster  on  hard  networks  than  static  schedules  [30].  In 
this  chapter,  I  explore  dynamic  schedules  both  for  inference  and  for  learning. 

First,  I  propose  a  new  dynamic  schedule  the  inference  problem,  which  I  call  resid¬ 
ual  BP  with  lookahead  zero  (RBPOL)  (Section  6.1).  The  idea  behind  the  new  schedule 
is  to  compute  each  message’s  priority  cheaply,  by  the  sum  of  how  much  its  antecedents 
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have  changed.  It  can  be  shown,  using  arguments  of  Ihlcr  et  al.  [49],  that  this  priority 
is  an  upper  bound  on  how  much  the  message  would  change  if  the  update  were  to 
be  computed.  On  a  natural-language  data  set,  our  schedule  is  more  computationally 
efficient  than  the  schedule  presented  previously  by  Elidan  et  al.  [30]. 

Second,  I  present  a  first  step  toward  the  goal  of  using  dynamic  schedules  to  perform 
inference  and  learning  in  the  same  system  of  equations  (Section  6.2).  Combining 
both  problems  into  a  single  system  potentially  allows  for  very  flexible  scheduling. 
The  idea  is  that  certain  areas  of  parameter  space  should  be  easier  for  approximate 
inference  than  others,  so  we  should  run  BP  for  longer  in  areas  that  are  more  difficult. 
Furthermore,  inference  in  certain  parts  of  the  model  may  converge  faster  than  others, 
so  we  should  focus  parameter  updates  on  the  parts  that  have  not  yet  converged.  On 
synthetic  data,  this  yields  a  significant  decrease  in  training  time. 

6.1  Dynamic  Schedules  for  Inference 

In  this  section,  I  introduce  an  efficient  dynamic  schedule  for  BP  message  updates. 
Previously,  Elidan  et  al.  proposed  a  schedule  that  propagates  the  message  whose  value 
has  changed  the  most.  I  call  this  schedule  residual  BP  with  lookahead  one  (RBP1L). 
Although  this  schedule  was  shown  to  be  often  more  effective  than  static  schedules, 
it  has  the  difficulty  that  it  determines  a  message’s  priority  by  actually  computing  it, 
which  means  that  many  message  updates  are  “wasted”,  that  is,  they  are  computed 
solely  for  the  purpose  of  computing  their  priority,  and  are  never  actually  performed. 
A  significant  fraction  of  messages  computed  by  RBP1L  are  wasted  in  this  way.  The 
main  idea  is  that  rather  than  computing  the  residual  of  each  pending  message  update, 
it  is  far  more  efficient  to  approximate  it.  Recent  work  [49]  has  examined  how  a  message 
error  can  be  estimated  as  a  function  of  its  incoming  errors.  In  our  situation,  the  error 
arises  because  the  incoming  messages  have  been  recomputed.  The  arguments  from 
Ihler  et  al.  apply  also  to  the  message  residual,  which  leads  to  effective  method  for 
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estimating  the  residual  of  a  message,  and  to  a  dynamic  schedule  that  is  dramatically 
more  efficient  than  RBP1L. 

In  this  section,  I  first  describe  how  the  message  residual  can  be  upper-bounded  by 
the  residuals  of  its  incoming  messages  (Section  6.1.2).  I  also  describe  a  method  for  es¬ 
timating  the  message  residual  when  the  factors  themselves  change  (for  example,  from 
parameter  updates),  which  leads  to  an  intuitive  method  for  initializing  the  residual 
estimates.  Then  I  introduce  a  novel  message  schedule,  called  residual  BP  with  looka¬ 
head  zero  (RBPOL)  (Section  6.1.3).  On  several  synthetic  and  real-world  data  sets,  we 
show  that  RBPOL  is  as  much  as  five  times  faster  than  RBP1L  but  still  finds  the  same 
solution  (Section  6.1.4).  Finally,  I  examine  how  to  what  extent  the  distance  that  a 
message  changes  in  a  single  update  predicts  its  distance  to  its  final  converged  value 
(Section  6. 1.4.3).  I  measure  distance  in  several  different  ways,  including  the  dynamic 
range  of  the  error  and  the  Bethe  energy.  Surprisingly,  the  difference  in  Bethe  energy 
has  almost  no  predictive  value  for  whether  a  message  update  is  nearing  convergence. 


6.1.1  Background 

In  this  section,  I  focus  on  generative  models,  so  that  we  have  a  distribution  p(y) 
factorize  according  to  an  undirected  factor  graph  G  with  factors  {4/a(ya)}^=1.  This 
choice  is  simply  to  lighten  notation,  and  the  inference  algorithms  of  this  section  apply 
readily  to  conditional  models  as  well.  I  use  the  indices  a  and  b  to  denote  factors  of  G, 
and  the  indices  s  and  t  to  denote  variables.  By  {s  G  a}  I  mean  the  set  of  all  variables 
s  in  the  domain  of  the  factor  4/a,  and  conversely  by  {b  E3  s},  I  mean  the  set  of  all 
factors  Tf,  that  have  variable  s  in  their  domain. 

Recall  from  Section  2.1.4  that  belief  propagation  updates  are  given  by 
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which  are  iterated  until  a  fixed  point  is  reached.  In  the  above,  k  is  a  normalization 
constant  to  ensure  the  message  sums  to  1.  The  initial  messages  are  set  to  some 
arbitrary  value,  typically  a  uniform  distribution. 

These  message  updates  can  be  written  in  a  generic  fashion  as 

m^iycd)  <-  k  ^a(Vc)  II  mt\yc),  (6-2) 

yc\ycd  {b£N(c)}\d 

where  c  and  d  may  be  either  factors  or  variables,  as  long  as  they  are  neighbors,  N (c) 
means  the  set  of  neighbors  of  c,  and  4 ta(yc)  is  understood  to  be  the  identity  if  c  is  a 
variable.  This  notation  abstracts  over  whether  a  message  is  being  sent  from  a  factor 
or  from  a  variable,  which  is  convenient  for  describing  message  schedules. 

In  general,  these  updates  may  have  multiple  fixed  points,  and  they  are  not  guar¬ 
anteed  to  converge.  Convergent  methods  for  optimizing  the  Bethe  energy  have  been 
developed  [145,  151],  but  they  are  not  used  in  practice  both  because  they  tend  to  be 
slower  than  iterating  the  messages  (6.1),  and  because  when  the  BP  updates  do  not 
converge,  it  has  been  observed  that  the  Bethe  approximation  is  bad  anyway. 

Now  we  describe  in  more  detail  how  the  iterations  are  actually  performed  in  a  BP 
implementation.  This  level  of  detail  will  prove  useful  in  the  next  section  for  under¬ 
standing  the  behavior  of  dynamic  BP  schedules.  A  vector  m  =  {mcd}  is  maintained 
of  all  the  messages,  which  is  initialized  to  uniform.  Then  until  the  messages  are 
converged,  we  iterate:  A  message  mcci  is  selected  according  to  the  message  update 
schedule.  The  new  value  rn'cd  is  computed  from  its  dependent  messages  in  m,  accord¬ 
ing  to  (6.1).  Finally,  the  old  message  (c,  d)  in  m  is  replaced  with  the  newly  computed 
value  m'cd. 

The  important  part  of  this  description  is  the  distinction  between  when  a  mes¬ 
sage  update  is  computed  and  when  it  is  performed.  When  a  message  is  computed, 
this  means  that  its  new  value  is  calculated  according  to  (6.1).  When  a  message  is 
performed,  this  means  that  the  current  message  vector  m  is  updated  with  the  new 
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value.  Synchronous  BP  implementations  compute  all  of  the  updates  first,  and  then 
perform  them  all  at  once.  Asynchronous  BP  implementations  almost  always  perform 
an  update  as  soon  as  it  is  computed,  but  it  is  possible  to  compute  an  update  solely 
in  order  to  determine  its  priority,  and  not  perform  the  update  until  later.  As  we 
describe  below,  this  is  exactly  the  technique  used  by  the  Elidan  et  al.  [30]  schedule. 

6.1.2  Estimating  Message  Residuals 

In  this  section,  I  describe  how  to  compute  an  upper  bound  on  the  error  of  a 
message,  which  will  be  used  as  a  priority  for  scheduling  messages.  I  define  the  error 
ecd{Ucd )  °f  a  message  m^d+1\ycd)  as  its  multiplicative  distance  from  its  previous  value 
(ycd)  ,  so  that 

micd+1\ycd )  ^  ecd{ycd)m(^{ycd).  (6.3) 

I  define  the  residual  of  a  message  m^d+1\ycd)  as  the  worst  error  over  all  assignments, 
that  is, 

r(mS+1))  =  max  I l°g ecrf(2/ccz) |  =  max  log  •  (6.4) 

Vcd  ycd  mcd  (ycd) 

This  corresponds  to  using  the  infinity  norm  to  measure  the  distance  between  log 
message  vectors,  that  is,  ||  logm^+1)  —  logm^ll,*,. 

An  alternative  error  measure  is  the  dynamic  range  of  the  error,  which  has  been 
studied  by  Ihlcr  et  al.  [49].  This  is 


d(m 


(fc+i) 

cd 


max  log 

Vcdi  y'cd 


& cdijjcd ) 

ecd(y’a) 


(6.5) 


Later  we  compare  the  residual  and  the  dynamic  error  range  as  priority  functions  for 
message  scheduling. 

In  the  rest  of  this  section,  we  show  how  to  upper-bound  the  message  errors  in  two 
different  situations:  when  the  values  of  a  message’s  dependents  change,  and  when  the 
factors  of  the  model  change. 
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First,  suppose  that  we  have  available  a  previously-computed  message  value  for 
m<£i(yd),  so  that 

mS}(2/d)  =  «5^1Mj/c)  n  m6c}(?/c),  (6.6) 

Vcd  {b£N(c)}\d 

and  that  now  new  messages  are  available  for  the  dependents.  We  wish 

to  upper  bound  the  residual  r(rn(:^+l))  without  actually  repeating  the  update  (6.6). 
Then  the  residual  can  be  upper-bounded  simply  by  the  following: 


r(mcd+1))  <  r(mbc+1))- 

{feeiV(c)} 


(6.7) 


Proof.  We  show  that  the  residual  is  both  subadditive  and  contracts  under  the  mes¬ 


sage  update,  following  [49].  To  show  subadditivity,  define  the  message  product 
Mbc+1)(yC )  =  H{beN(c)}\dmbkc+1) (Vc )>  and  define  similarly.  Also,  define  the  resid¬ 
ual  r(Mfe^+1))  as 
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=  max 

Vc 
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(6.8) 


Then  we  have 


r(M^+1'>)  =  max 

Vc 


5>g 


m 


(fc+i) 

be 


(Ve 


mleiVc 


< 


£ 


max 

2/c 


log 


m 


(fc+i) 


fee 


(yc 


^£}(?/c 
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which  follows  from  the  subadditivity  of  absolute  value,  and  an  increase  in  the  degrees 
of  freedom  of  the  maximization. 

To  show  contraction  under  the  message  update,  we  apply  the  fact  that 
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This  directly  yields 
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(6.11) 

(6.12) 

(6.13) 


□ 

Now  consider  the  second  situation,  when  a  factor  Ta  changes.  Define  ea  to  be  the 
multiplicative  error  in  the  factor,  so  that 


*ifc+1)(y.)  =  ea(ya)'f'ifc)(ya).  (6.14) 

(k) 

Suppose  we  have  already  computed  a  message  rncJ ,  so  that  in  the  current  message 
vector 

m(S(Vd)  =  XXfc)(2/c)  11  (6-15) 

Vcd  {beN(c)}\d 

and  as  before  we  wish  to  upper  bound  r(m^+1'>).  Then  substitution  into  (6.4)  yields 

\T/  (  \ 

r(rn^+1))  <  max  a  Ya  .  (6.16) 

y“  ^a{  ya) 

6.1.3  Dynamic  BP  Schedules 

In  this  section,  I  describe  the  previously  proposed  schedule  (RBPOL)  and  our  new 
schedule  (RBP1L). 
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Algorithm  6.2  RBPIL  [30] 

function  RbpIl  () 

1:  m  < —  uniform  message  array 
2:  q  <—  InITIALPq() 

3:  repeat 

4:  mbc  <—  DEQUEUE(g) 

5:  m|bc  mbc  {Perform  update.} 

6:  for  all  d  in  {d  G  N(c)}\b  do 

7:  Compute  update  mcci 

8:  Remove  any  pending  update  mcd  from  q 

9:  Add  mcc]  to  q  with  priority  r(mcd) 

10:  end  for 

11:  until  messages  converged 

function  InitialPq  () 

1:  q  < —  empty  priority  queue 

2:  for  all  messages  (c,  d)  do  {Initialize  q} 

3:  Compute  update  mai 

4:  Add  rricci  to  q  with  priority  r(mcd) 

5:  end  for 
6:  return  q 


6. 1.3.1  Residual  BP  with  Lookahead  (RBPIL) 

Elidan  et  al.  [30]  call  their  algorithm  residual  belief  propagation,  but  in  the  next 
section  we  introduce  a  different  BP  schedule  that  also  depends  on  the  message  resid¬ 
ual.  Therefore,  to  avoid  confusion  we  refer  to  the  Elidan  et  al.  algorithm  by  the  more 
specific  name  of  residual  BP  with  lookahead  one  (RBPIL). 

The  basic  idea  in  RBPIL  (Algorithm  6.2)  is  that  whenever  a  message  mcd  is 
pending  for  an  update,  the  message  is  computed  and  placed  on  a  priority  queue  to 
be  performed.  The  priority  of  the  message  is  the  distance  between  its  current  value 
and  its  newly-computed  value:  the  exact  distance  measure  is  not  specified  by  Elidan 
et  ah,  although  they  assume  that  it  is  based  on  a  norm  \\mcd  —  ||  between  the 

difference  in  message  values.  We  use  the  residual  (6.4)  between  log  message  values. 

The  problem  with  this  schedule  can  be  seen  in  Lines  7-9  of  Algorithm  6.2.  When 
an  update  mbc  is  performed,  each  of  its  dependents  rnC(i  is  recomputed  and  placed  in 
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Algorithm  6.3  RBPOL 
function  RbpOl  () 

1:  m  < —  uniform  message  array 

2:  T  < —  total  residuals;  initialized  to  0 

3:  q  <—  InITIALPq() 

4:  repeat 

5:  mbc  <—  DEQUEUE(g) 

6:  Compute  update  rribc  and  residual  r  =  r{mbc ) 

7:  m|bc  rribc  {Perform  update.} 

8:  For  all  ab,  do  T(ab,  be)  0 

9:  For  all  cd,  do  T(bc,  cd )  T(bc,  cd )  +  r 

10:  for  all  d  in  {d  G  N(c)}\b  do 

11:  v  T{ac,  cd) 

12:  Remove  any  pending  update  (c,  d)  from  q 

13:  Add  tried  to  q  with  priority  v 

14:  end  for 

15:  until  messages  converged 

function  InitialPq  () 

1:  q  < —  empty  priority  queue 

2:  for  all  messages  (c,  d)  do  {Initialize  q} 

3:  Compute  update  mcd 

4:  V  <-  maxj,c  \YC\  | log  'Pc(yc)| 

5:  Add  mcd  to  q  with  priority  v 

6:  end  for 
7:  return  q 


the  queue.  If  a  previous  update  rri^  was  already  pending  in  the  queue,  then  that 
message  is  discarded.  I  refer  to  this  as  a  “wasted”  update.  In  Section  6.1.4,  we  see 
that  this  is  a  relatively  common  occurrence  in  RBP1L,  so  preventing  this  can  yield 
significant  gains  in  convergence  speed. 


6. 1.3. 2  Avoiding  Lookahead  (RBPOL) 

In  this  section  we  present  our  dynamic  schedule,  residual  BP  with  lookahead  zero 
(RBPOL).  In  Section  6.1.2  we  saw  that  a  residual  can  be  upper-bounded  by  its  sum 
of  incoming  residuals.  The  idea  behind  RBPOL  is  to  use  that  upper  bound  as  the 
message’s  priority,  so  that  an  update  is  never  computed  unless  it  will  actually  be 
performed.  The  full  algorithm  is  given  in  Algorithm  6.3. 


155 


There  are  three  fine  points  here.  The  first  question  is  how  to  update  the  residual 
estimate  when  a  message  mbc(yc )  is  updated  twice  before  one  of  its  dependents  mcd(yd ) 
is  updated  even  once.  In  the  most  general  case,  each  dependent  may  have  actually 
seen  a  different  version  of  when  it  was  last  updated.  Naively  applying  the  bound 
(6.7)  would  suggest  that  we  retain  the  version  of  m&c  as  it  was  when  each  of  its 
dependents  last  saw  it.  Bnt  this  becomes  somewhat  expensive  in  terms  of  memory. 
Instead,  for  each  pair  of  messages  (6,  c)  and  (c,  d)  we  maintain  a  total  residual  T(bc ,  cd) 
of  how  much  the  message  m&c  has  changed  since  mcd  was  last  updated.  Estimates 
of  the  priority  of  mai  are  always  computed  using  the  total  residual,  rather  than  the 
single-update  residual.  (This  preserves  the  upper-bound  property  of  the  residual 
estimates.) 

The  second  question  is  how  to  initialize  the  residual  estimates.  Recall  that  the 
messages  m  are  initialized  to  uniform.  Imagine  that  those  initial  messages  were 
obtained  by  starting  with  a  factor  graph  in  which  all  factors  are  uniform,  running 
BP  to  convergence,  and  then  modifying  the  factors  to  match  those  in  the  actual  graph. 
From  this  viewpoint,  the  argument  in  Section  6.1.2  shows  that  an  upper  bound  on 
the  residual  from  uniform  messages  is 


r (mcd)  <  max 

Vc 


log 


^c{Vc) 

Uc(Vc) 


(6.17) 


where  uc  is  a  normalized  uniform  factor  over  the  variables  in  yc.  Therefore,  we  use 
this  upper  bound  as  the  initial  priority  of  each  update. 

Finally,  we  need  a  way  to  approximate  the  residuals  if  damping  is  used.  The 
important  point  here  is  that  when  a  message  is  sent  with  damping,  even  after 
the  update  is  performed,  the  residual  rrif)C  is  nonzero,  because  the  full  update  has  not 
been  taken.  To  handle  this,  whenever  a  damped  message  mcd  is  sent,  the  residual 
r(rribc )  is  computed  exactly  and  is  added  to  the  queue  with  that  priority.  (For 
simplicity,  this  is  not  shown  in  Algorithm  6.3.) 
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6. 1.3. 3  Application  to  Non-inference  Domains 

RBP1L  has  the  advantage  of  being  more  general:  it  can  readily  be  applied  to  any 
set  of  fixed-point  equations,  potentially  ones  that  are  very  different  than  those  used 
in  approximate  inference.  On  the  other  hand,  RBPOL  appears  to  be  more  specific 
to  BP,  because  the  residual  bounds  assume  that  BP  updates  are  being  used.  For 
similar  algorithms,  such  as  max-product  BP  and  GBP,  it  is  likely  that  the  same 
scheme  would  be  effective.  For  a  completely  different  set  of  fixed-point  equations, 
applying  RBPOL  would  require  both  designing  a  new  method  for  approximating  the 
update  residuals,  and  designing  an  efficient  way  for  initializing  the  residual  updates. 
That  said,  our  residual  estimation  procedure,  which  simply  sums  up  the  antecedent 
residuals,  is  fairly  generic,  and  thus  likely  to  perform  well  in  a  variety  of  domains. 

6.1.4  Experiments 

In  this  section,  we  compare  the  convergence  speed  of  RBPOL  and  RBP1L  on  both 
synthetic  and  real-world  graphs. 

6. 1.4.1  Synthetic  Data 

We  randomly  generate  N  x  N  grids  of  binary  variables  with  pairwise  Potts  factors. 
Each  pairwise  factor  has  the  form 


*«(</.,  Vt) 


(6.18) 


where  the  equality  strength  a  is  sampled  uniformly  from  [—(7,  C'\ .  Higher  values  of  C 
make  inference  more  difficult.  The  unary  factors  have  the  form  ^>s(ys)  =  [1  e~Us ], 
where  us  is  sampled  uniformly  from  [—(7,  C ].  We  generate  50  distributions  for  (7  =  5. 
For  smaller  values  of  (7,  inference  becomes  so  easy  that  all  schedules  performed  equally 
well.  For  larger  values  of  (7,  the  same  trend  holds,  but  the  the  convergence  rates  are 
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Number  of  messages  (xlOOO) 


Figure  6.1.  Convergence  of  RBPOL  and  RBP1L  on  synthetic  10  x  10  grids  with 
C  —  5.  The  x-axis  is  number  of  messages  computed.  RBPOL  converges  faster. 

much  lower.  We  use  the  grid  size  N  =  10  so  that  exact  inference  is  still  feasible.  We 
measure  running  time  by  the  number  of  message  updates  computed.  This  measure 
closely  matches  the  CPU  time.  Both  algorithms  are  considered  to  have  converged 
when  no  pending  update  has  a  residual  of  greater  than  10~3.  The  algorithms  are 
considered  to  have  diverged  if  they  have  not  converged  after  the  equivalent  of  1000 
complete  sweeps  of  the  graph. 

The  rate  of  convergence  of  the  different  schedules  are  shown  in  Figure  6.1.  We  see 
that  RBPOL  converges  much  more  rapidly  than  RBP1L,  although  both  eventually 
converge  on  the  same  percentage  of  networks. 

Figure  6.2  shows  the  number  of  messages  required  for  convergence  for  each  sampled 
model.  Each  integer  on  the  x-axis  represents  a  different  randomly-generated  model, 
sorted  by  the  number  of  messages  required  by  RBP1L.  Thus,  the  model  at  x-index 
0  is  the  easiest  model  for  RBP1L,  and  so  on.  Each  curve  is  the  number  of  messages 
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Repetition 


Figure  6.2.  Updates  performed  by  RBPOL  and  RBP1L  on  synthetic  data.  The 
horizontal  line  is  the  nnmber-of- messages  cutoff.  The  y-axis  is  logarithmic. 
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Messages  sent 

Accuracy 

TRP 

3  079  570 

97.6 

RBPOL 

839  250 

97.4 

RBP1L 

2  685  702 

97.3 

Table  6.1.  Performance  of  BP  schedules  on  skip-chain  test  data. 


required,  as  a  function  of  this  rank.  The  horizontal  line  is  the  number-of-messages 
cutoff,  so  points  that  exceed  that  line  represent  models  for  which  BP  did  not  converge. 
The  y-axis  is  logarithmic. 

RBPOL  computes  on  average  half  as  many  messages  as  RBP1L.  RBPOL  uses 
fewer  messages  than  RBP1L  in  46  of  the  50  sampled  models.  In  three  of  the  sampled 
models,  RBP1L  converges  but  RBPOL  does  not,  which  appear  in  Figure  6.2  as  the 
peaks  where  the  RBPOL  curve  is  the  only  one  that  touches  the  horizontal  line.  In 
three  other  models,  RBPOL  converges  but  RBP1L  does  not,  which  appear  as  the 
valleys  where  RBPOL  does  not  touch  the  horizontal  line,  but  the  other  curves  do. 
The  dashed  curve  in  the  figure  shows  the  number  of  updates  actually  performed  by 
RBP1L.  On  average,  38%  of  the  updates  computed  by  RBP1L  are  never  performed. 
Surprisingly,  RBPOL  performs  fewer  updates  than  RBP1L  performs;  that  is,  it  is 
more  efficient  even  if  wasted  updates  are  not  counted  against  RBP1L.  This  may  be  a 
beneficial  effect  of  our  choice  of  initial  residual  estimates. 

Finally,  we  measure  the  accuracy  of  the  marginals  for  RBPOL  and  RBP1L.  For 
both  schedules,  we  measure  the  average  per-variablc  KL  from  the  exact  distribution 
to  the  BP  belief.  When  both  schedules  converge,  the  average  per-variablc  KL  is 
nearly  identical:  the  mean  absolute  difference,  averaged  over  the  50  random  models, 
is  0.0038. 
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6. 1.4. 2  Natural-Language  Data 

Finally,  we  consider  a  model  with  many  irregular  loops,  which  is  the  skip  chain 
conditional  random  held  introduced  in  Section  3.2.  This  model  incorporates  certain 
long-distance  dependencies  between  word  labels  into  a  linear-chain  model  for  infor¬ 
mation  extraction.  The  resulting  networks  contain  many  loops  of  varying  sizes,  and 
exact  inference  using  a  generic  junction-tree  solver  is  intractable.  We  evaluate  on 
the  seminars  data  set  described  in  that  section.  The  emails  on  average  contain  273.1 
tokens,  but  the  maximum  is  3062  tokens.  The  messages  have  an  average  of  23.5 
skip  edges,  but  the  maximum  is  2260,  indicating  that  some  networks  are  connected 
densely. 

We  generate  networks  as  follows.  Using  ten-fold  cross-validation  with  a  50/50 
train/test  split,  we  train  a  skip-chain  CRF  using  TRP  until  the  model  parameters 
converge.  Then  we  evaluate  the  RBPOL,  RBP1L,  and  TRP  on  the  test  data,  mea¬ 
suring  the  number  of  messages  sent,  the  running  time,  and  the  accuracy  on  the  test 
data.  As  in  the  last  section,  RBPOL  and  RBP1L  are  considered  to  have  converged 
if  no  pending  update  has  a  residual  of  more  than  10~3.  TRP  is  considered  to  have 
converged  if  no  update  performed  on  the  previous  iteration  resulted  in  a  residual  of 
greater  than  10~3.  In  all  cases,  the  trained  model  parameters  are  exactly  the  same; 
the  inference  algorithms  are  varied  only  at  test  time,  not  at  training  time. 

Table  6.1  shows  the  performance  of  each  of  the  message  schedules,  averaged  over 
the  10  folds.  RBPOL  uses  one-third  of  the  messages  as  RBP1L,  and  one-fifth  of  the 
CPLT  time,  but  has  essentially  the  same  accuracy.  Also,  RBL0L  uses  27%  of  the 
messages  used  by  TRP. 

In  our  implementation,  the  CPLT  time  required  per  message  update  is  much  higher 
for  the  RBP  schedules  than  for  TRP.  The  total  running  time  for  RBPOL  is  66s, 
compared  to  110s  for  TRP  and  321s  for  RBP1L.  This  is  partially  because  of  the 
overhead  in  maintaining  the  priority  queues  and  residual  estimates,  but  also  this 
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KL  (old,  new) 


logZ_Bethe  (new)  -  logZ_Bethe  (old) 


Figure  6.3.  Comparison  of  error  metrics  in  predicting  the  distance  to  convergence. 
(See  text  for  explanation.) 


is  because  our  TRP  implementation  is  a  highly  optimized  one  that  we  have  used  in 
much  previous  work,  whereas  our  RBP  implementations  have  more  room  for  low-level 
optimization. 

6. 1.4. 3  Error  Estimates 

The  message  residual  is  an  intuitive  error  measure  to  use  for  scheduling,  but 
there  are  many  others  that  are  conceivable.  In  this  section,  I  compare  different  error 
measures  to  evaluate  how  reliable  they  are  at  predicting  the  next  message  to  send. 
Ideally,  we  would  evaluate  a  priority  function  for  messages  by  whether  higher  priority 
messages  actually  reduces  the  computation  time  required  for  convergence.  But  it 
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is  extremely  difficult  to  compute  this,  so  we  instead  measure  the  distance  to  the 
converged  message  values,  as  follows. 

We  generate  a  synthetic  grid  as  in  Section  6. 1.4.1.  (The  graphs  here  are  from 
a  single  sampled  model,  but  different  samples  result  in  qualitatively  similar  results.) 
Then,  we  run  RBPOL  on  the  grid  to  convergence,  yielding  a  set  of  converged  messages 
m*.  Finally,  we  run  RBPOL  again  on  the  same  grid,  without  making  use  of  m*.  After 
each  message  update  of  RBPOL  i— >  m^d+1\  we  measure: 

a.  The  residual  of  the  errors  e(m^ ,  m^+1')),  e(m^,m*d),  and  e(m^d+1\  m*d) 

b.  The  dynamic  range  of  the  same  errors 

c.  The  KL  divergences  KL(m^||m^+1)),  KL(m^||m*d),  and  KL(m^+1)||m*d); 

d.  The  change  in  Bethe  energy  log  ZBP(m^’+1^)  —  logZBP(m^). 

Thus  we  can  measure  how  well  each  of  the  error  metrics  predicts  the  distance  to 
convergence  r(e(m^+1),  m*d))— r(e(m^ ,m*d)).  This  is  shown  in  Figure  6.3.  Each  plot 
in  that  figure  shows  a  different  distance  measure  between  messages:  from  top  left,  they 
are  message  residual,  error  dynamic  range,  KL  divergence,  and  difference  in  Bethe 
energy.  Each  point  in  the  figures  represents  a  single  message  update.  In  all  figures, 
the  x-axis  shows  the  distance  between  the  message  m\ \d  at  the  previous  iteration  and 
the  value  at  the  current  iteration.  The  y-axis  shows  the  change  in  distance 

to  the  converged  messages,  that  is,  how  much  closer  the  update  at  k  +  1  brought 
the  message  to  its  converged  value.  We  measure  this  as  the  difference  between  the 
residuals  e(m^+1\m*d)  and  e(m^ ,  m*d) .  Negative  values  of  this  measure  are  better, 
because  they  mean  that  the  distance  to  the  converged  messages  has  decreased  due  to 
the  update.  An  ideal  graph  would  be  a  line  with  negative  slope. 

Both  the  message  residual  and  the  dynamic  error  range  display  a  clear  upper- 
bounding  property  on  the  absolute  value.  Also,  the  points  are  somewhat  clustered 
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along  the  diagonal,  indicating  some  kind  of  a  linear  relationship  between  the  single- 
message  distance  and  the  distance  to  convergence.  The  single-message  distance  does 
not  seem  to  do  well,  however,  at  predicting  in  which  direction  the  message  will  change, 
that  is,  closer  or  farther  from  its  converged  value.  Qualitatively,  the  residual  and  the 
error  range  seem  to  perform  similarly  at  predicting  the  distance  to  the  converged 
messages,  but  in  preliminary  experiments,  using  the  error  range  in  a  one-lookahead 
schedule  seemed  to  converge  slightly  slower  than  using  the  residual. 

The  message  KL  also  seems  to  do  a  poor  job  of  predicting  the  distance  to  the 
converged  message.  More  surprisingly,  the  difference  in  Bethe  energy  is  almost  com¬ 
pletely  uninformative  about  the  distance  to  converged  messages.  This  suggests  an 
intriguing  explanation  of  the  slow  converge  of  gradient  methods  for  optimizing  the 
Bethe  energy:  perhaps  the  objective  function  itself  is  simply  not  good  at  measuring 
what  we  care  about.  It  is  possible  that  the  Bethe  approximation  may  be  accurate 
at  convergence  but  still  not  be  accurate  outside  of  the  constraint  set,  that  is,  when 
the  messages  are  not  locally  consistent.  This  is  precisely  the  situation  that  occurs 
during  message  scheduling.  For  this  reason,  it  may  be  more  revealing  to  look  at  the 
Lagrangian  of  the  Bethe  energy  rather  than  the  objective  function  itself. 

6.2  Dynamic  Schedules  for  Inference  and  Learning 

In  this  section,  I  describe  dynamic  schedules  for  the  combined  inference  and  learn¬ 
ing  problem.  Learning  algorithms  for  structured  models — such  as  maximum  likeli¬ 
hood,  max-margin  methods  [25,  129],  and  search-based  approaches  [28] — all  require 
inference  across  a  set  of  labeled  training  examples,  which  is  then  repeated  for  many 
settings  of  the  model  parameters.  This  repeated  inference  is  intractable  for  general 
models  and  can  be  expensive  even  for  tractable  models  when  the  training  set  is  large. 

Most  practical  methods  for  learning  structured  models  can  be  abstractly  described 
as  follows.  The  object  is  to  optimize  some  loss  function  £(6)  with  respect  to  the  model 
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parameters  6.  The  parameters  are  updated  iteratively,  so  that  at  iteration  t,  we  take 
the  current  parameter  setting  9®  and  based  on  the  training  data,  compute  an  update 
direction  A 9,  and  set  0(t+1>  <—  #d)  +  aA9,  where  a  is  some  step  size.  For  example, 
in  maximum  likelihood  training,  £  is  the  likelihood,  and  A 9  the  likelihood  gradient 
with  respect  to  9.  Often,  computing  A 9  will  be  intractable  in  general,  so  some 
approximation  is  used,  such  as  a  variational  approximation  for  the  likelihood,  or  a 
best-first  search  for  a  max-margin  method. 

This  leads  to  two  observations.  First,  not  only  is  the  parameter  estimation  al¬ 
gorithm  iterative,  but  often  the  inference  algorithm  is  iterative  as  well.  Iterative 
inference  algorithms  include  variational  methods  such  as  mean  field  and  belief  propa¬ 
gation,  and  search  algorithms  such  as  simulated  annealing.  The  second  observation  is 
that  some  areas  of  the  model  may  be  more  difficult  than  others  for  training  or  infer¬ 
ence.  If  a  certain  area  of  the  model  is  easy  to  train,  we  could  save  time  by  “locking” 
its  parameters  and  focusing  computational  effort  on  the  more  difficult  areas  of  the 
model. 

In  this  section,  we  exploit  both  of  these  observations  by  defining  a  single  set  of 
fixed-point  updates  that  integrate  inference  and  learning.  We  focus  on  the  case  of 
training  loopy  Markov  random  fields  using  belief  propagation  (BP).  We  view  the  BP 
message  updates  and  the  likelihood  gradient  updates  as  a  single  set  of  fixed  point 
equations,  which  we  are  free  to  iterate  according  to  any  schedule.  Ordinarily,  these 
are  scheduled  by  running  the  belief  equations  to  convergence,  using  those  beliefs 
to  compute  an  approximate  gradient,  and  taking  a  step  in  that  direction.  But  more 
efficient  schedules  are  possible.  In  particular,  dynamic  schedules  provide  an  especially 
great  amount  of  flexibility: 

•  First,  a  dynamic  scheduler  can  choose  to  make  gradient  updates  before  the 
inference  updates  have  converged,  thus  providing  a  way  to  perform  training 
using  early  stopping  of  BP. 
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•  Second,  just  as  a  schedule  can  prioritize  a  message  if  its  incoming  messages 
change  greatly,  it  can  also  prioritize  a  message  whose  local  factor  has  changed 
greatly  due  to  a  parameter  update.  Thus,  inference  can  focus  on  the  areas  of 
the  model  in  which  the  parameters  are  changing  most  rapidly.  In  particular,  a 
dynamic  schedule  is  free  to  ignore  regions  of  the  model  for  which  training  has 
essentially  converged. 

•  Finally,  if  the  training  set  consists  of  iid  examples,  it  can  be  presented  to  the 
scheduler  as  a  single,  disconnected  graphical  model.  This  means  that  when 
the  scheduler  is  choosing  which  area  of  the  model  to  perform  inference  in,  it 
is  also  choosing  which  training  instance  to  perform  inference  in.  Because  gra¬ 
dient  updates  occur  can  before  performing  inference  on  the  entire  training  set, 
the  resulting  method  is  similar  to  stochastic  gradient  descent,  except  that  the 
scheduler  chooses  the  batch  compositions  automatically. 

Also,  Teh  and  Welling  [132]  have  previously  proposed  an  algorithm  that  integrates 
message  updates  and  updates  from  iterative  scaling.  Since  iterative  scaling  can  also 
be  used  for  parameter  estimation  in  Markov  random  fields,  their  technique  is  in  the 
same  spirit  as  ours.  Iterative  scaling  has  repeatedly  been  shown  to  converge  much 
slower  than  gradient  updates  for  parameter  estimation  [65,  79,  112,  143],  so  a  method 
that  uses  gradient  updates  has  the  potential  to  be  a  significant  improvement. 

In  this  section,  I  present  an  integrated  system  of  equations  for  gradient  and  BP 
updates  of  Markov  random  fields  (Section  6.2.1),  and  we  describe  how  the  updates 
may  be  iterated  using  a  dynamic  schedule.  Then,  we  show  that  the  integrated  sched¬ 
ule  converges  significantly  faster  than  running  BP  to  convergence  (Section  6.2.2), 
resulting  in  an  almost  three-fold  decrease  in  training  time  with  equivalent  likelihood. 
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6.2.1  Combining  Inference  and  Learning 

Let  p(y)  factorize  according  to  a  factor  graph  with  factors  {Ta},  where  as  usual 
each  'I'a  has  the  exponential  form  \ha(ya)  =  exp{^fc  Oakfak(ya)}-  We  wish  to  estimate 
the  parameters  given  data  with  empirical  distribution  p.  In  Section  3.1.4,  we  saw  that 
the  gradient  of  the  BP  likelihood  is 


<9£bp 

oea 


=  ga( ya)  =  P(y°)fa(ya )  -  ^2  9a(ya)/a(y0), 

y  a  ya 


(6.19) 


Also,  we  have  seen  that  the  beliefs  are  computed  by  finding  fixed-points  of  the  system 
of  equations 


mas(ys)  <-  ^  II  (6-20) 

ya\ys  t£N(a)\s 

msa(ys )  <-  JJ  mbs(ys)  (6.21) 

b&N  (s)\a 

upon  which  the  beliefs  are  calculated  as 

?o(: ya)  =  ^a(y0)  JJmao(ya)  (6.22) 

sEa 

Thus,  inference  and  learning  can  be  viewed  as  attempting  to  find  a  fixed  point  of 
a  single  system  of  equations.  This  system  is 

mas(ys )  <-  II  mtaiyt ) 

ya\ys  teN(a)\s 

msa(ys )  <-  JJ  mbs(ys ) 

b&N(s)\a  (6.23) 

Qa( Ya) 

s£a 

@a  *  @a  “1“  ®-9a(P a) 

where  a  is  a  step  size,  and  ga  is  the  approximate  gradient  (6.19). 
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Algorithm  6.4  IRBP 
function  Irbp  (G) 

1:  q  <—  InitialPq(G) 

2:  repeat 

3:  Vlhs  <-  DEQUEUE(g) 

4:  Perform  update  indicated  by  vuls 

5:  for  all  equations  e  that  depend  on  Vuls  do 

6:  ADD(g,  e) 

7:  end  for 

8:  until  updates  converged 

function  InitialPq  (G) 

1:  q  < —  empty  priority  queue 
2:  ADD(g,msa)  V(s,  a)  G  G 

3:  ADD(g,mas)  V(s,  a)  G  G 

4:  ADD(g,  Ta)  Va  G  G 
5:  return  g 

function  Add  (g, 

1:  m  <—  Current  value  of  variable  vuls 

2:  m!  <—  Compute  new  value  of  vihs  from  its  associated  equation 
3:  Add  vihs  to  g  with  priority  \\m'  —  m\\ 


This  is  a  system  of  fixed-point  equations,  in  which  the  variables  are  V  =  {mas}  U 
{rnsa(ys)}  U  (ga(ya)}  U  {#a}.  In  general,  a  fixed  point  of  this  system  can  be  computed 
using  a  dynamic  propagation  schedule,  by  maintaining  a  priority  queue  of  pending 
updates,  and  performing  message  updates  that  would  cause  the  greatest  residual.  Let 
vihs  denote  some  variable  on  the  left-hand  side  of  one  of  the  update  equations  (6.23). 
Let  vjH  be  the  value  of  some  variable  v  G  V  at  some  iteration  t.  Then  we  define  the 
residual  of  the  update  at  iteration  t  as  the  vector  ||u^+1')  —  u^.||,  where  ||  •  ||  is  a  norm. 
We  use  the  infinity  norm,  so  that  the  residual  is  the  maximum  absolute  value. 

However,  the  equations  (6.23)  are  in  fact  not  amenable  to  dynamic  scheduling. 
The  reason  for  this  is  that  the  gradient  update  on  8  does  not  decrease  its  residual. 
Once  a  set  of  messages  are  fixed,  the  gradient  update  is  linear  in  8,  and  so  has  no 
maximum.  There  is  nothing  to  stop  an  uniformed  schedule  from  iterating  the  gradient 
update  infinitely. 
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We  can  avoid  this  problem  in  two  ways.  First,  adding  regularization  to  the  ob¬ 
jective  penalizes  large  parameter  values  directly.  Second,  we  can  cast  the  gradient 
update  by  in  a  different  way.  A  descent  step  along  the  gradient  (6.19)  can  be  viewed 
as  updating  each  factor  \F„  by: 


tfa(-;0a)  <-  Va(-;0a  +  aga{6a) 


'Fa  ( •;  9a  +  a(^p(ya)  -  qa( ya)) 


ya 


(6.24) 

(6.25) 


This  leads  to  the  system  of  equations 


mas(ys) 
msa(ys ) 

*«,(■;  0«) 


^o(yo)  n 

y  a\ys  teN(a)\s 


I  [  rnbs(ys 

b£N  (s)\a 

'Fa  ( ']  @a  +  OL 


(6.26) 


J^P(ya)  ~  Va(ya)  Y[msa(ys 


Ya 


s£a 


The  main  advantage  to  this  viewpoint  is  that  whenever  two  gradient  updates  are 
made  in  a  row,  then  factor  resulting  from  the  new  parameters  are  used  to  recompute 
the  belief.  This  breaks  the  self-loop,  because  now  altering  Ta  causes  qa  to  better 
match  the  empirical  belief,  and  thus  the  gradient  to  decrease.  This  also  helps  to 
prevent  the  factor  updates  from  being  selected  too  often. 

Now  this  system  of  equations  (6.26)  can  be  iterated  to  find  a  fixed  point.  Any 
fixed  point  corresponds  to  a  saddlepoint  of  the  BP  likelihood.  In  fact,  this  system 
of  equations  generalizes  a  typical  method  of  training  using  loopy  BP.  Typically  [e.g., 
128],  we  iterate  the  BP  iterations  to  convergence,  then  take  a  gradient  step  using  a 
second-order  method,  and  continue  until  convergence;  this  can  be  seen  as  a  particular 
schedule  for  solving  the  system  (6.26).  Running  BP  to  convergence  may  be  wasteful, 
however,  over  several  gradient  steps,  the  beliefs  may  remain  stable  in  some  areas  of 
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the  model,  meaning  that  running  BP  to  convergence  in  those  parts  of  the  model  is 
wasteful.  An  alternative  idea  is  to  allow  taking  gradient  steps  before  the  BP  steps 
have  converged;  that  is,  to  schedule  the  equations  (6.26)  as  a  single  set  of  fixed-point 
updates,  in  which  the  scheduler  makes  no  distinction  between  gradient  updates  and 
BP  updates.  This  is  the  approach  that  we  take  here. 

Running  BP  to  convergence  is  one  schedule  for  propagating  the  updates  (6.26), 
but  other  schedules  may  be  more  efficient.  Recently,  it  has  been  shown  that  dynamic 
schedules  for  belief  propagation — in  particular,  residual  belief  propagation  (RBP), 
which  propagates  messages  with  the  largest  residual  first — can  lead  to  a  great  im¬ 
provement  in  inference  time  [30].  But  this  technique  can  be  applied  to  any  system 
of  fixed-point  equations,  and  here  we  apply  it  to  the  integrated  inference/estimation 
system  of  equations  (6.26).  The  potential  advantage  of  this  schedule  is  that  the  num¬ 
ber  of  inference  updates  and  gradient  updates  can  be  varied  in  different  portions  of 
the  model,  and  in  different  regions  of  parameter  space.  We  call  this  method  integrated 
residual  belief /gradient  propagation  (IRBP).  It  is  described  in  detail  in  Algorithm  6.4. 


6.2.2  Experiments 

In  this  section,  we  compare  the  convergence  speed  of  IRBP  to  a  gradient  update 
in  which  BP  is  run  to  convergence.  We  randomly  generate  N  x  N  grids  of  binary 
variables  with  pairwise  Potts  factors.  Each  pairwise  factor  has  the  form 


^ij(ys,yt) 


(6.27) 


where  the  equality  strength  a  is  sampled  uniformly  from  [—C,C].  Higher  values  of 
C  mean  that  the  constraints  are  stronger  on  average.  The  unary  factors  have  the 
form  tys(ys)  =  [1  e~Ua],  where  us  is  sampled  uniformly  from  [—0,0].  I  generate  50 
distributions  using  C  —  4.  I  use  the  grid  size  N  =  10.  For  each  sampled  model,  we 
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BFGS/TRP  lik 

Figure  6.4.  Approximate  likelihood  of  final  1RBP  parameters  compared  to 
BFGS/TRP  on  generative  synthetic  model.  The  red  line  is  y  =  x. 

generate  100  training  examples.  I  use  a  Gaussian  prior  on  parameters  with  variance 

a2  =  10. 

I  compare  the  IRBP  schedule  to  parameter  estimation  using  BFGS,  with  the 
marginals  computed  by  running  TRP  to  convergence  (BFGS/TRP).  TRP  has  previ¬ 
ously  been  shown  to  outperform  synchronous  and  naive  asynchronous  BP  schedules 
[30,  139].  IRBP  converges  dramatically  faster  than  BFGS/TRP.  Figure  6.5  shows 
the  running  time  measured  by  floating  point  operations  for  each  sampled  model:  on 
average  BFGS/TRP  requires  8  times  as  many  floating  point  operations.  The  y- axis 
in  Figure  6.5  shows  the  ratio  of  BFGS/TRP  flops  to  IRBP  flops.  Figure  6.6  shows  the 
CPU  time  used  for  each  sampled  model.  Here  the  y  axis  shows  the  ration  of  running 
times  of  BFGS/TRP  to  IRBP.  On  average,  IRBP  uses  2.75  times  less  CPU  time  than 
BFGS/TRP.  It  is  interesting  that  the  speed  difference  is  much  more  pronounced  when 
measured  in  floating-point  operations  than  wall-clock  time.  This  indicates  that  the 
overhead  of  maintaining  the  update  queue  and  the  residuals  is  a  significant  portion  of 
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Figure  6.5.  Running  time  of  IRBP  parameters  and  BFGS/TRP  on  generative  syn¬ 
thetic  model,  measured  in  number  of  floating-point  operations  (flops).  The  red  line 
is  y  —  1,  which  occurs  when  the  two  are  equally  fast. 

the  IRBP  running  time.  (Our  TRP  algorithm  has  had  significant  amount  of  low-level 
optimization  applied,  while  our  IRBP  implementation  has  not.) 

In  Figure  6.4,  we  show  the  training  likelihood  of  the  final  parameter  settings 
found  by  IRBP  and  BFGS/TRP.  Because  both  schedules  use  BP,  in  both  cases  we 
approximate  the  likelihood  using  the  Bethe  energy.  Both  methods  find  equally  good 
parameter  settings. 

Now,  previous  work  has  shown  that  a  residual-based  schedule  for  BP  can  greatly 
outperform  TRP  [30].  So  it  may  be  objected  that  the  observed  decrease  in  training 
time  is  due  to  using  a  better  BP  schedule,  rather  than  interleaving  the  BP  and  gradient 
updates.  To  ensure  that  this  is  not  the  case,  we  measure  the  training  time  for  simple 
gradient  descent,  where  the  gradient  is  approximated  by  running  RBP  to  convergence 
(GD/RBP).  To  be  clear,  in  GD/RBP,  we  use  the  residual  schedule  for  the  BP  updates 
only,  and  never  for  the  gradient  updates.  This  schedule  runs  significantly  slower  than 
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Figure  6.6.  Running  time  of  IRBP  parameters  and  BFGS/TRP  on  generative  syn¬ 
thetic  model,  measured  in  CPU  time  (s). 

IRBP:  the  average  training  time  for  GD/RBP  is  28.2s,  while  the  average  for  IRBP  is 
10.4s. 

6.2.3  Related  Work 

The  closest  work  to  ours  is  the  unified  propagation  and  scaling  algorithm  [132], 
which  was  introduced  as  a  method  for  minimizing  KL  divergence  from  a  reference 
distribution  to  an  approximating  family,  which  is  generalization  of  the  maximum 
likelihood  problem.  That  algorithm  is  essentially  a  schedule  for  BP  updates  on  the 
messages  and  iterative  scaling  updates  on  the  model  parameters.  Thus,  it  can  be 
seen  as  a  kind  of  integrated  inference  and  learning.  However,  it  is  well  known  that 
iterative  scaling  updates  converge  much  slower  than  gradient  updates  for  undirected 
parameter  estimation  [65,  79,  112,  143]. 

Bayesian  methods,  of  course,  make  no  distinction  between  inference  and  learning. 
Although  theoretically  attractive,  integrating  over  the  parameters  becomes  very  diffi¬ 
cult  when  the  other  variables  in  the  model  have  complex  connections  of  their  own.  For 


173 


this  reason,  there  has  been  very  little  work  in  Bayesian  training  of  Markov  random 
holds.  Exceptions  include  Murray  and  Ghahramani  [85]  and  Qi  et  al.  [95]. 

For  the  fully-observed  generative  models  that  we  consider  here,  parameters  that 
maximize  the  BP  likelihood  may  be  found  more  quickly  by  pseudo-moment  matching 
[142],  This  does  not  make  our  techniques  obsolete,  however.  The  pseudo-moment 
matching  estimator  is  unavailable  when  there  are  latent  variables,  or  when  the  factors 
have  a  restricted  form,  such  if  they  are  restricted  to  a  continuous  exponential  family, 
or  if  some  parameters  are  tied.  These  situations  often  arise  in  practical  models,  so 
learning  methods  that  compute  the  BP  updates  are  still  relevant. 

6.3  Conclusion 

In  this  chapter,  I  have  explored  dynamic  schedules  for  message  updates,  both  for 
the  inference-only  problem  and  the  combined  inference-and-learning  problem.  For 
the  inference-only  problem,  I  have  presented  RBPOL,  a  new  dynamic  schedule  for 
belief  propagation  that  schedules  messages  based  on  a  upper-bound  on  their  residual. 
On  both  synthetic  and  real- world  data,  RBPOL  converges  faster  than  both  RBP1L, 
a  recently-proposed  dynamic  schedule,  and  than  TRP,  with  comparable  accuracy.  It 
would  be  interesting  to  explore  whether  the  residual  estimation  technique  in  RBPOL 
is  equally  effective  for  other  inference  algorithms,  such  as  EP,  GBP,  or  whether  the 
residual  estimation  technique  would  require  significant  adaptation.  In  continuous 
spaces,  it  may  be  that  the  message  residual  itself  is  not  a  good  measure  for  schedul¬ 
ing,  because  it  gives  equal  weight  to  all  areas  of  the  domain,  even  those  with  low 
probability.  The  KL  divergence  may  be  more  appropriate. 

For  the  inference-only  problem,  I  have  presented  suggestive  results  that  combin¬ 
ing  inference  and  learning  into  a  single  system  of  equations  can  lead  to  significant 
speed-ups  in  training  time  for  undirected  models,  resulting  in  an  almost  three-fold 
improvement  in  training  time  with  no  loss  in  likelihood.  Typically,  training  is  per- 
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formed  by  running  an  approximate  inference  algorithm  such  as  BP  to  convergence, 
then  using  the  resulting  approximate  marginals  to  approximate  the  gradient.  But  this 
is  only  one  schedule  for  a  more  general  system  of  equations,  and  I  show  that  residual- 
based  schedules  can  perform  significantly  faster.  Of  course,  this  leaves  open  whether 
this  schedule  would  work  as  well  on  larger-scale  networks,  such  as  conditional  ran¬ 
dom  fields.  Unfortunately,  the  heavy  parameter  tying  that  is  typical  in  CRFs  makes 
application  of  1RBP  difficult. 
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CHAPTER  7 


FUTURE  DIRECTIONS 


In  this  chapter,  I  mention  some  research  directions  in  the  area  of  approximate 
CRF  training  that  may  be  useful  for  future  work. 

7.1  Bigger  Pieces 

A  natural  question  is  how  readily  the  methods  here  extend  to  the  case  where 
pieces  are  treated  as  regions  that  arc  larger  than  a  single  factor.  The  difficulty  seems 
to  lie  in  how  to  handle  the  overlaps  among  larger  pieces.  But  the  connection  to  the 
Bethe  energy  is  illuminating  here.  Just  as  the  factor-as-piece  likelihood  arises  from 
the  dual  Bethe  energy  with  uniform  messages,  larger  pieces  can  potentially  be  handled 
by  using  uniform  messages  in  a  more  general  free  energy,  such  as  region  graph  free 
energies  [150].  That  said,  I  do  not  believe  that  the  models  introduced  in  this  thesis 
are  complex  enough  to  benefit  from  larger  pieces. 

Another  question  then  becomes  how  to  choose  the  pieces  when  they  are  larger 
than  a  single  factor.  Quite  possibly  the  application  itself  would  suggest  a  choice  of 
pieces.  There  are  also  a  few  minimal  consistency  criteria  that  have  been  proposed — in 
particular,  maxent-normality  [150]  and  non-singularity  [146] — but  these,  while  useful, 
are  fairly  general,  and  do  not  place  severe  constraints  on  the  region  choice.  Finally, 
there  is  a  least  one  method  in  the  literature  for  choosing  regions  in  generalized  BP 
[144];  however,  this  method  chooses  the  regions  based  on  properties  of  the  entire 
distribution,  that  is,  on  the  model  parameters.  It  seems  unwise  to  use  the  model 
parameters  to  choose  the  objective  function  that  they  are  selected  to  optimize,  so  this 
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techniques  seems  inapplicable.  There  is  also  a  possible  connection  between  methods 
for  selecting  pieces  and  feature  selection  methods. 


7.2  Pseudolikelihood 

Recall  from  Chapters  4  and  5  that  pseudolikelihood  [8]  is  defined  as 

4>l (0)  =  Y  logP(ys\yN(s), x)  (7-1) 

s 

_  ria9i  ^a{ysi  YN(s),  xa)  2^ 

Although  on  simple  synthetic  data  sets,  pseudolikelihood  tends  to  perform  well  [90], 
on  our  benchmark  NLP  data  it  performs  fairly  poorly.  This  phenomenon  warrants 
an  explanation. 

There  exists  a  simple  class  of  data  sets  in  which  pseudolikelihood  performs  patho¬ 
logically.  As  an  example,  consider  a  two-node  Boltzmann  machine  with  binary  vari¬ 
ables.  That  is,  each  variable  has  a  unary  factor 


ys(ys)  =  [1  e  9s' 


and  there  is  one  binary  factor 


v{yo,yi) 


1  e°ij 

^eaij  1.  j 


(7.3) 


(7.4) 


Assume  the  data  set  contains  two  observations:  (0,  0)  and  (1, 1).  Then  the  maximum 
pseudolikelihood  estimate  becomes  senseless:  maximizing  it  will  attempt  to  make  the 
conditional  distributions  deterministic,  with  complete  disregard  for  the  per-variable 
marginals  p(yo)  and  p(yi). 
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One  way  to  understand  this  is  that  pseudolikelihood  attempts  to  match  the  con¬ 
ditional  distributions  of  the  model  to  the  conditional  distributions  of  the  data.  But 
the  conditional  distributions  given  by  the  data  need  not  determine  a  unique  joint  dis¬ 
tribution;  they  do  so  only  if  the  Gibbs  sampler  that  they  define  is  ergodic.  This  is  not 
the  case  in  the  example  above.  It  is  unlikely,  however,  that  this  phenomenon  explains 
the  results  that  we  see  on  the  real-world  data,  however,  because  although  pseudo- 
likelihood  performs  significantly  worse  than  piecewise  ond  maximum  likelihood,  it 
does  not  appear  to  be  as  pathologically  bad  as  it  would  if  this  phenomenon  were  the 
culprit. 

A  different  potential  explanation  is  that  pseudolikelihood  may  perform  poorly  in 
the  presence  of  data  sparsity,  by  which  I  mean  that  because  there  are  a  large  number 
of  input  features,  and  the  cardinality  of  the  output  variables  is  large,  each  input- 
output  combination  is  observed  only  a  few  times,  if  at  all.  Sparsity  is  a  ubiquitous 
feature  of  NLP  applications,  even  when  hundreds  of  thousands  of  words  of  labeled 
data  are  available.  The  reason  this  situation  may  be  bad  for  pseudolikclihood  is  that 
if  a  neighborhood  configuration  (ys,  yv(.s))  never  occurs  in  the  training  data,  then  the 
objective  function  makes  no  attempt  to  set  the  model  marginal  of  that  configuration 
to  0. 

One  way  to  address  this  problem  is  to  add  a  penalty  to  the  pseudolikclihood  that 
attempts  to  force  such  configurations  to  have  zero  probability.  This  is  potentially  a 
very  fruitful  approach  in  practice. 

Another  potential  set  of  techniques  to  address  this  problem  results  from  the  fol¬ 
lowing  viewpoint.  The  pseudolikclihood  gradient  is  given  by 

§r=  E^E  fakiy a-)  -X-a)  EEEE  fak( fa,  X-a)p(y's\y N(s),x)P(y N(s)  |x) 

ak  faS  G  k  n>ae  G  k  s£a  j /' 

(7.5) 
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where  da  is  the  degree  of  factor  a,  that  is,  the  number  of  variables  in  its  domain;  and 
p  is  the  empirical  distribution.  This  can  be  actually  be  seen  as  an  approximation  to 
the  exact  likelihood  gradient  as  follows: 

fak{y ai  Xq)  EES  /afc(yl,xa)p(ya|x)  (7.6) 

ak  Va&G  k  faeG  k  y'a 

=  EE  /afc(ya;Xa)  EEEE  fak(y'a,Xa)p(y's\y'N(s)i*)p(y'N(s)\x) 

'J'aSG  k  a  G  k  y'a  sea 

(7.7) 

“EE  /afc(y<i)Xa)  EEEE  fak(y'a,  *a)p(y's  |y^(s),  x)p(y)v(,)|x), 

'J'aSG  fc  a  ^aeG  k  y'a  s£a 

(7.8) 

which  is  equivalent  to  the  pseudolikelihood  gradient.  Thus,  pseudolikelihood  can  be 
seen  as  approximating  the  model’s  neighborhood  marginal  p(yN(s))  by  the  empirical 
marginal  p(yN(s))-  Therefore,  a  potential  way  to  improve  pseudolikelihood  to  choose  a 
different  approximation  for  the  neighborhood  marginal,  for  example,  using  smoothing. 
Indeed,  any  approximate  inference  method  may  be  used  to  generate  the  neighborhood 
approximation.  Although  1  find  this  framework  potentially  appealing,  it  allows  a  wide 
range  of  learning  methods,  and  in  order  to  know  what  instance  of  the  framework  is 
most  profitable,  it  would  be  useful  to  replicate  the  data  sparsity  failures  in  a  synthetic 
domain,  if  indeed  this  is  the  culprit. 

7.3  Other  Directions 

Other  directions  for  future  work  include: 

•  Online  Updates.  As  1  mentioned  in  Section  1.2,  for  the  exact  CRF  likelihood, 
online  gradient  updates  have  been  shown  to  converge  faster  than  second-order 
batch  updates  [42,  136].  Conceptually,  there  seems  to  be  no  obstacle  to  applying 
them  to  the  piecewise  likelihoods  of  this  thesis.  This  is  unlikely  to  present  a 
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significant  research  challenge,  but  verifying  that  this  combination  works  could 
be  practically  important,  because  it  could  result  in  very  fast  training  times. 

•  Other  applications.  In  order  to  evaluate  different  training  methods,  I  focus  on 
a  small  set  of  benchmark  NLP  data  sets,  which  are  of  practical  interest,  but  in 
fact  they  are  amenable  to  existing  training  methods.  But  really  the  point  is  to 
enable  training  of  large,  loopy  CRFs  which  have  not  previously  been  feasible. 
Examples  of  applications  that  could  benefit  from  such  CRFs  include  cross¬ 
document  information  extraction  and  coreference,  joint  models  for  cascades  of 
NLP  tasks,  and  scoped  learning  for  structured  models. 

•  Latent  Variables.  Piecewise  training  and  its  variants  assume  fully  labeled  data, 
but  models  with  latent  variables  are  of  considerable  interest.  The  naive  piece- 
wise  method  is  unlikely  to  work  well  with  latent  variables,  because  if  the  same 
latent  variable  occurs  in  multiple  pieces,  it  needs  to  be  constrained  to  have  the 
same  semantics  in  each  piece,  and  the  naive  method  does  not  do  this.  It  is 
possible  that  the  connections  to  BP  can  be  used  to  devise  a  method  that  alter¬ 
nates  between  regular  piecewise  and  message  passing  in  order  to  handle  latent 
variables. 

•  Max-margin  methods.  Both  likelihood-based  methods  and  max-margin  methods 
require  performing  inference  during  training,  so  it  is  natural  to  wonder  whether 
the  methods  in  this  thesis  can  be  adapted  to  loopy  max-margin  models.  A 
suggestive  step  in  this  direction  is  factorized  MIRA  [75],  in  which  the  margin 
constraints  are  required  to  hold  only  over  single  edges,  rather  than  the  entire 
prediction.  On  a  dependency  parsing  task,  this  method  had  good  accuracy, 
but  it  did  not  improve  training  time  because  the  model  had  special  structure 
that  made  it  amenable  to  exact  inference.  It  may  be  interesting  to  see  whether 
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analogs  of  piecewise  methods  work  well  for  max-margin  training  on  loopy  mod¬ 
els. 
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